[00:00:51] (03PS2) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) [00:01:11] ah, interesting! I wonder if there's an opportunity to make that more sophisticated, then ... i.e., separate from the question of why the l10n updates are happening (and in turn bumping mtimes) at all [00:03:24] I'm still confirming some details of this, my main piece of evidence is that I made two copies of a file, ran rsync -n, touched the target and ran rsync -n again [00:03:56] and after touching, rsync -n reported an increased "Total transferred file size" [00:03:56] (03Abandoned) 10Aaron Schulz: [DNM] rest-gateway: map restbase sandbox URLs to Special:RestSandbox/wmf-restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190753 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [00:04:11] I can confirm that a small change to an l10n input file can result in a large change to the corresponding .cdb file [00:04:24] (03Abandoned) 10Aaron Schulz: restgateway: make spec-json-wikimedia catch non-www domain too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202323 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [00:05:02] a CDB file has three seconds: pointers, records and hashtables [00:05:08] s/seconds/sections [00:05:30] the pointers and hashtables will all change but the records should be mostly constant as long as they were written in the same order [00:05:34] but I'm still confirming that [00:05:50] I believe they're written in insertion order. [00:07:11] (03PS3) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) [00:07:19] swfrench-wmf: For bare metal deployment this is optimized by converting the CDB files to JSON and rsyncing only the JSON files, and reconstituting on the target hosts. [00:09:40] dancy: ah, so the JSON intermediate representation kinda fans out the changes? (i.e., rather than the whole CDB file getting shipped, a much smaller subset of the JSON files it fans out to do?) [00:11:01] Yeah. [00:11:42] * swfrench-wmf thumbs up [00:14:05] FIRING: [4x] SystemdUnitFailed: docker-registry.service on registry1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1070:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:36] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.32 - https://phabricator.wikimedia.org/T409510 (10RLazarus) 03NEW [00:22:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1070:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:31] docker layers work at a file granularity anyway don't they? so the fact that the files are similar in some sense doesn't help us [00:30:30] if so the JSON thing wouldn't help either [00:34:10] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:55] RESOLVED: [9x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202878 [00:39:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202878 (owner: 10TrainBranchBot) [00:44:09] (03PS1) 10RLazarus: mesh: Bump requirement to 1.15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202879 (https://phabricator.wikimedia.org/T409183) [00:47:23] if you run rsync on local files, the default is --whole-file, you have to run with --no-whole-file if you want it to do its rolling block checksum thing [00:48:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:41] (03CR) 10Scott French: [C:03+1] mesh: Bump requirement to 1.15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202879 (https://phabricator.wikimedia.org/T409183) (owner: 10RLazarus) [00:50:16] -n also disables it [00:51:15] (03CR) 10RLazarus: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202879 (https://phabricator.wikimedia.org/T409183) (owner: 10RLazarus) [00:53:01] (03Merged) 10jenkins-bot: mesh: Bump requirement to 1.15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202879 (https://phabricator.wikimedia.org/T409183) (owner: 10RLazarus) [00:53:16] the cdb->json->rsync->cdb dance in the scap code for bare metal was an optimization that A.aronSchulz came up with in 2013/2014. The binary CDB files have really large rsync deltas, but the json dumps are relatively well behaved under rsync. [00:53:55] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:55:36] (03PS1) 10RLazarus: mesh.configuration: Copy 1.15.0 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202880 (https://phabricator.wikimedia.org/T409510) [00:55:38] (03PS1) 10RLazarus: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) [00:56:09] if I understand correctly, we can disable it now for a 3.4GB saving in image size per MW version [00:56:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1202878 (owner: 10TrainBranchBot) [00:56:37] that's the size of /srv/mediawiki-staging/php-1.46.0-wmf.1/cache/l10n/upstream/*.cdb.json [00:56:47] the containers shouldn't need the json files for sure [00:57:20] beta cluster still uses the bare metal flow, so they are useful there [00:57:24] (03CR) 10CI reject: [V:04-1] mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:58:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:09] we should do the numbers on that [01:00:56] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:10] (03PS2) 10RLazarus: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) [01:03:23] The JSON form on the l10n files is not included in the images [01:03:35] *of the [01:03:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:13] * bd808 closes the half written feature request [01:05:04] thanks dancy, sorry, evidence of that is in my backscroll to be honest [01:05:40] In that case, definitely file a bug! [01:05:55] no you are right the files are not there [01:06:21] so, still no ideas for making scap fast [01:07:46] getting the 5 minute sleep out of the image upload would help on the big ones. At least I think that is still in there somewhere. [01:08:02] Sadly it is [01:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:09:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202883 [01:09:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202883 (owner: 10TrainBranchBot) [01:10:23] idea: try harder to not update the l10n files on unrelated changes [01:10:34] It looks like it has been a while since anyone tilted at T99740 [01:10:37] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [01:11:43] it's not the data format, it's the fact that they're merged, so any change to the english localisation causes a change to all other localisations that don't override that message [01:12:00] if they're PHP you have the same problem [01:12:29] you can merge at runtime but it's very hot, there would be a performance impact [01:12:39] right. that's the bit about file level diff vs byte level diff you pointed out before [01:13:07] I'm off for the weekend. Good luck yall! [01:13:16] bye [01:13:25] rsync is being misused to estimate the size of a docker layer, so --whole-file is correct for that [01:14:30] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 34s) [01:15:55] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:31:55] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1202883 (owner: 10TrainBranchBot) [01:36:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:55] FIRING: [6x] SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch1073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:55] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:18] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1068 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:13:55] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:28] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1094 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:21:18] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch1068 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:25:18] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1095 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:28:55] FIRING: [9x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:28] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1094 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:29:58] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1074 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:30:22] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1075 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:33:55] FIRING: [9x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:18] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1095 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:37:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:38:55] RESOLVED: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:58] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1074 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:40:22] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1075 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:43:55] FIRING: [17x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:02] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1081 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:50:50] FIRING: DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:53:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [02:53:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [02:53:55] FIRING: [14x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:35] (03Merged) 10jenkins-bot: Add English translations to namespaces that lack them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202368 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [02:54:37] (03Merged) 10jenkins-bot: Set robot noindex policy for draft namespaces that lacked it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202369 (https://phabricator.wikimedia.org/T407127) (owner: 10Tim Starling) [02:55:10] FIRING: [14x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:15] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1202368|Add English translations to namespaces that lack them (T407127)]], [[gerrit:1202369|Set robot noindex policy for draft namespaces that lacked it (T407127)]] [02:55:18] T407127: [WE5.2.5 Milestone] Limit returned namespaces in default sitemap response - https://phabricator.wikimedia.org/T407127 [02:57:50] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1202368|Add English translations to namespaces that lack them (T407127)]], [[gerrit:1202369|Set robot noindex policy for draft namespaces that lacked it (T407127)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:58:08] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1118 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:58:43] !log tstarling@deploy2002 tstarling: Continuing with sync [02:58:55] FIRING: [15x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:02] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1081 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:03:55] RESOLVED: [10x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:13] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202368|Add English translations to namespaces that lack them (T407127)]], [[gerrit:1202369|Set robot noindex policy for draft namespaces that lacked it (T407127)]] (duration: 09m 58s) [03:05:17] T407127: [WE5.2.5 Milestone] Limit returned namespaces in default sitemap response - https://phabricator.wikimedia.org/T407127 [03:06:53] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [03:06:57] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [03:08:08] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch1118 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:47:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:47:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:49:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:14:12] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:17:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:17:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:19] (03PS1) 10Kevin Bazira: ml-services: update revertrisk-wikidata isvc in experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202908 (https://phabricator.wikimedia.org/T406179) [06:06:03] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1202688 (owner: 10L10n-bot) [06:14:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [06:17:28] (03PS1) 10Marostegui: site.pp: Update s6 note [puppet] - 10https://gerrit.wikimedia.org/r/1202910 [06:18:25] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1202910 (owner: 10Marostegui) [06:18:26] (03CR) 10Marostegui: [C:03+2] site.pp: Update s6 note [puppet] - 10https://gerrit.wikimedia.org/r/1202910 (owner: 10Marostegui) [06:20:33] (03CR) 10Marostegui: [C:04-1] "It already has that line, the host is migrated." [puppet] - 10https://gerrit.wikimedia.org/r/1202773 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [06:20:44] (03CR) 10Marostegui: [C:03+1] db2164: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202774 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [06:21:11] (03CR) 10Marostegui: [C:03+1] db2166: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202775 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [06:21:19] (03CR) 10Marostegui: [C:03+1] db2167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202776 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [06:21:27] (03CR) 10Marostegui: [C:03+1] db2181: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202777 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [06:35:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [06:47:25] (03PS1) 10Giuseppe Lavagetto: Bugfix release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1202911 [06:48:28] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1202911 (owner: 10Giuseppe Lavagetto) [06:49:39] (03PS1) 10Marostegui: installserver: Do not format db1260 [puppet] - 10https://gerrit.wikimedia.org/r/1202912 [06:50:50] FIRING: DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:51:36] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq fix, conftool 6 - oblivian@cumin1003" [06:51:39] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq fix, conftool 6 - oblivian@cumin1003 [06:51:51] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db1260 [puppet] - 10https://gerrit.wikimedia.org/r/1202912 (owner: 10Marostegui) [06:52:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:52:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T407997)', diff saved to https://phabricator.wikimedia.org/P85052 and previous config saved to /var/cache/conftool/dbconfig/20251107-065226-marostegui.json [06:52:30] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:52:30] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit by wmfuniq fix, conftool 6 - oblivian@cumin1003 [06:52:32] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Rate-limit by wmfuniq fix, conftool 6 - oblivian@cumin1003" [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T0700) [07:08:29] (03PS1) 10Marostegui: db1262: Add note about memory issues [puppet] - 10https://gerrit.wikimedia.org/r/1202913 (https://phabricator.wikimedia.org/T409374) [07:09:14] (03CR) 10Marostegui: [C:03+2] db1262: Add note about memory issues [puppet] - 10https://gerrit.wikimedia.org/r/1202913 (https://phabricator.wikimedia.org/T409374) (owner: 10Marostegui) [07:09:30] (03CR) 10Marostegui: [C:03+2] "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1202913 (https://phabricator.wikimedia.org/T409374) (owner: 10Marostegui) [07:12:04] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove dbprov1003 & dbprov2003 role and set them "insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1202754 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [07:13:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T407997)', diff saved to https://phabricator.wikimedia.org/P85053 and previous config saved to /var/cache/conftool/dbconfig/20251107-071310-marostegui.json [07:13:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:27:42] !log fix failed logrotation on install1005 [07:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P85054 and previous config saved to /var/cache/conftool/dbconfig/20251107-072818-marostegui.json [07:29:37] (03CR) 10Slyngshede: [C:03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [07:29:58] (03CR) 10Muehlenhoff: [C:03+2] ganeti-ca: Warn after 90 days [alerts] - 10https://gerrit.wikimedia.org/r/1202696 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [07:30:35] RESOLVED: DiskSpace: Disk space install1005:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=install1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:34:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on install1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:50] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11352516 (10elukey) editcheck's metrics seem to lead to: ` execution: found duplicate series for the match group {sloth_id="edit-check-edit-check-pre-save-checks-ratio"} on the right hand-side of the operation... [07:43:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P85055 and previous config saved to /var/cache/conftool/dbconfig/20251107-074326-marostegui.json [07:45:27] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts dbprov2003.codfw.wmnet [07:48:34] (03PS1) 10Jcrespo: site.pp: Remove last references to dbprov2003 (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1202915 (https://phabricator.wikimedia.org/T409524) [07:49:47] (03PS1) 10Jcrespo: site.pp: Remove last references to dbprov1003 (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1202916 (https://phabricator.wikimedia.org/T409524) [07:50:15] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [07:50:47] (03CR) 10Kosta Harlan: EventBus: Enable TYPE_EVENT for loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: 10Kosta Harlan) [07:52:22] (03PS1) 10Muehlenhoff: preseed: Configure es2028 with db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1202918 (https://phabricator.wikimedia.org/T408777) [07:58:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T407997)', diff saved to https://phabricator.wikimedia.org/P85056 and previous config saved to /var/cache/conftool/dbconfig/20251107-075833-marostegui.json [07:58:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:58:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [07:58:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T407997)', diff saved to https://phabricator.wikimedia.org/P85057 and previous config saved to /var/cache/conftool/dbconfig/20251107-075857-marostegui.json [07:59:33] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T0800) [08:00:20] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [08:00:21] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:00:21] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbprov2003.codfw.wmnet [08:01:45] (03CR) 10Marostegui: [C:03+1] "This will erase the /srv command, the issue also happens with a normal installation (where we do not format it too)" [puppet] - 10https://gerrit.wikimedia.org/r/1202918 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:02:00] (03CR) 10Marostegui: [C:03+1] "https://phabricator.wikimedia.org/T408777#11327045" [puppet] - 10https://gerrit.wikimedia.org/r/1202918 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:02:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [08:02:56] checking [08:03:38] 06SRE, 07SRE-Unowned, 10Maps: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528 (10MoritzMuehlenhoff) 03NEW [08:05:58] (03CR) 10Jcrespo: [C:03+2] site.pp: Remove last references to dbprov2003 (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1202915 (https://phabricator.wikimedia.org/T409524) (owner: 10Jcrespo) [08:06:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [08:07:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [08:13:13] oh, I didn't notice it [08:18:31] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529 (10MoritzMuehlenhoff) 03NEW [08:18:41] (03Abandoned) 10Aklapper: Update funneling to invalid https://wikimediafoundation.org/zh/ [puppet] - 10https://gerrit.wikimedia.org/r/1201689 (https://phabricator.wikimedia.org/T407579) (owner: 10Aklapper) [08:18:41] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11352600 (10MoritzMuehlenhoff) [08:18:49] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11352601 (10MoritzMuehlenhoff) [08:19:27] (03CR) 10Brouberol: [C:03+2] airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [08:19:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T407997)', diff saved to https://phabricator.wikimedia.org/P85058 and previous config saved to /var/cache/conftool/dbconfig/20251107-081934-marostegui.json [08:19:39] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:21:56] (03Merged) 10jenkins-bot: airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [08:24:13] (03Abandoned) 10Federico Ceratto: db2152: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202773 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [08:25:22] (03PS2) 10Federico Ceratto: db2164: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202774 (https://phabricator.wikimedia.org/T406008) [08:25:22] (03PS2) 10Federico Ceratto: db2166: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202775 (https://phabricator.wikimedia.org/T406008) [08:25:22] (03PS2) 10Federico Ceratto: db2167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202776 (https://phabricator.wikimedia.org/T406008) [08:25:22] (03PS2) 10Federico Ceratto: db2181: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202777 (https://phabricator.wikimedia.org/T406008) [08:25:43] (03CR) 10Brouberol: [C:03+2] Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [08:26:17] (03CR) 10Federico Ceratto: [C:03+2] db2164: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202774 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [08:27:06] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:27:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2164 - Upgrading db2164.codfw.wmnet [08:27:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2164 - Upgrading db2164.codfw.wmnet [08:28:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [08:28:25] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts dbprov1003.eqiad.wmnet [08:29:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:31:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:32:01] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [08:34:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [08:34:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P85060 and previous config saved to /var/cache/conftool/dbconfig/20251107-083442-marostegui.json [08:35:07] (03CR) 10Muehlenhoff: [C:03+2] preseed: Configure es2028 with db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1202918 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:37:42] jynus@cumin1003 decommission (PID 782434) is awaiting input [08:40:52] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [08:43:56] jynus@cumin1003 decommission (PID 782434) is awaiting input [08:44:28] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1003" [08:44:28] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:30] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbprov1003.eqiad.wmnet [08:48:56] (03PS2) 10Brouberol: growthbook: enable email sending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) [08:49:38] (03PS3) 10Brouberol: growthbook: enable email sending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) [08:49:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P85061 and previous config saved to /var/cache/conftool/dbconfig/20251107-084949-marostegui.json [08:49:53] (03CR) 10Jcrespo: [C:03+2] site.pp: Remove last references to dbprov1003 (decommissioned) [puppet] - 10https://gerrit.wikimedia.org/r/1202916 (https://phabricator.wikimedia.org/T409524) (owner: 10Jcrespo) [08:50:44] (03PS4) 10Brouberol: growthbook: enable email sending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) [08:56:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [08:57:21] (03CR) 10Brouberol: [C:03+2] growthbook: enable email sending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201084 (https://phabricator.wikimedia.org/T408904) (owner: 10Brouberol) [08:59:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:59:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:02:33] (03CR) 10Filippo Giunchedi: [C:03+1] Uninstall intel-microcode on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1201687 (owner: 10Muehlenhoff) [09:02:40] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11352757 (10jcrespo) [09:02:45] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11352759 (10jcrespo) [09:03:24] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbprov2003 - https://phabricator.wikimedia.org/T409525#11352760 (10jcrespo) [09:04:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T407997)', diff saved to https://phabricator.wikimedia.org/P85062 and previous config saved to /var/cache/conftool/dbconfig/20251107-090457-marostegui.json [09:05:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:05:09] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission dbprov2003 - https://phabricator.wikimedia.org/T409525#11352768 (10jcrespo) [09:05:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [09:05:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T407997)', diff saved to https://phabricator.wikimedia.org/P85063 and previous config saved to /var/cache/conftool/dbconfig/20251107-090521-marostegui.json [09:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11352776 (10ItamarWMDE) Thank you everyone! [09:12:21] (03CR) 10Elukey: [C:03+1] osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:17:25] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2164 gradually with 4 steps - Migration of db2164.codfw.wmnet completed [09:17:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:19:53] (03PS1) 10Brouberol: growthbook: fix the smtp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202981 (https://phabricator.wikimedia.org/T408904) [09:21:52] (03CR) 10Brouberol: [C:03+2] growthbook: fix the smtp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202981 (https://phabricator.wikimedia.org/T408904) (owner: 10Brouberol) [09:23:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:23:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:25:12] (03PS2) 10Brouberol: growthbook: define public configuration for s3 file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) [09:25:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T407997)', diff saved to https://phabricator.wikimedia.org/P85065 and previous config saved to /var/cache/conftool/dbconfig/20251107-092539-marostegui.json [09:25:44] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:25:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:26:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:39:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P85067 and previous config saved to /var/cache/conftool/dbconfig/20251107-094047-marostegui.json [09:50:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:51:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:51:32] (03CR) 10Brouberol: growthbook: define public configuration for s3 file uploads (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:54:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:55:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P85069 and previous config saved to /var/cache/conftool/dbconfig/20251107-095555-marostegui.json [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:02:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2164 gradually with 4 steps - Migration of db2164.codfw.wmnet completed [10:02:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:07:32] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps-test2002.codfw.wmnet [10:09:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:10:36] (03PS3) 10Abijeet Patro: Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) [10:11:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T407997)', diff saved to https://phabricator.wikimedia.org/P85071 and previous config saved to /var/cache/conftool/dbconfig/20251107-101102-marostegui.json [10:11:07] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:11:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:11:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T407997)', diff saved to https://phabricator.wikimedia.org/P85072 and previous config saved to /var/cache/conftool/dbconfig/20251107-101126-marostegui.json [10:13:07] (03CR) 10Abijeet Patro: "Done, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [10:14:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:18:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:20:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:20:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps-test2002.codfw.wmnet [10:20:29] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11352870 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps-test2002.codfw.wmnet` - maps-test2002.codfw.... [10:20:40] (03PS1) 10Esanders: Freeze LiquidThreads on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) [10:21:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [10:21:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [10:22:32] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps-test2003.codfw.wmnet [10:26:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:26:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:27:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:27:08] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:27:09] (03PS1) 10Slyngshede: P:cache::base allow geoip to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1202986 [10:27:20] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:27:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7570/console" [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [10:27:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:29:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:29:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:29:32] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7571/console" [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [10:29:40] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2090.codfw.wmnet with OS bullseye [10:29:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11352906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2090.codfw.wmnet with OS bullseye execute... [10:30:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:31:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:31:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T407997)', diff saved to https://phabricator.wikimedia.org/P85073 and previous config saved to /var/cache/conftool/dbconfig/20251107-103149-marostegui.json [10:31:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:31:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:32:12] (03CR) 10Slyngshede: [V:03+1] "Once merged we add profile::cache::base::use_geo_ip: false to the Puppet configuration in cloud for the cache hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [10:32:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:33:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:33:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:34:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:35:09] jmm@cumin2002 decommission (PID 411475) is awaiting input [10:35:21] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2091.codfw.wmnet with OS bullseye [10:35:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11352923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2091.codfw.wmnet with OS bullseye execute... [10:35:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:36:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:36:10] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Enable built-in Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1202989 (https://phabricator.wikimedia.org/T343885) [10:37:34] (03CR) 10Filippo Giunchedi: [V:03+1] "In light of https://phabricator.wikimedia.org/T409294#11351077 I don't think we need this" [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) (owner: 10Filippo Giunchedi) [10:37:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:38:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:38:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:38:59] jmm@cumin2002 decommission (PID 411475) is awaiting input [10:39:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11352946 (10jcrespo) I got an exception on host decommissioning script run, FYI: ` PASS |██████████████████████████████████████████████████████████████████████� [10:39:36] (03PS1) 10Tiziano Fogli: metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1202990 (https://phabricator.wikimedia.org/T397003) [10:39:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:39:42] (03PS1) 10Majavah: Add dummy toolviews hash salt [labs/private] - 10https://gerrit.wikimedia.org/r/1202991 [10:40:01] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1202990 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:40:06] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1202990 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:40:40] (03CR) 10Majavah: [V:03+2 C:03+2] Add dummy toolviews hash salt [labs/private] - 10https://gerrit.wikimedia.org/r/1202991 (owner: 10Majavah) [10:42:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:43:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:43:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:43:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps-test2003.codfw.wmnet [10:43:11] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11352977 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps-test2003.codfw.wmnet` - maps-test2003.codfw.... [10:43:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:43:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:43:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7573/co" [puppet] - 10https://gerrit.wikimedia.org/r/1202989 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [10:43:54] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2093.codfw.wmnet with OS bullseye [10:44:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:44:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11352986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2093.codfw.wmnet with OS bullseye execute... [10:45:40] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps-test2004.codfw.wmnet [10:45:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:46:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:46:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P85074 and previous config saved to /var/cache/conftool/dbconfig/20251107-104657-marostegui.json [10:49:18] jmm@cumin2002 decommission (PID 420350) is awaiting input [10:50:05] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Enable built-in Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1202989 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [10:50:21] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Enable built-in Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1202989 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [10:50:56] (03PS3) 10Brouberol: growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) [10:51:08] (03CR) 10Brouberol: growthbook: define configuration for local file uploads (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [10:51:56] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Remove IPv4-only monitoring override [puppet] - 10https://gerrit.wikimedia.org/r/1202993 [10:52:59] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7574/console" [puppet] - 10https://gerrit.wikimedia.org/r/1202993 (owner: 10Majavah) [10:53:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:53:53] (03Abandoned) 10Filippo Giunchedi: cloudceph: adjust mtu on cluster interface for single-nic [puppet] - 10https://gerrit.wikimedia.org/r/1202382 (https://phabricator.wikimedia.org/T409294) (owner: 10Filippo Giunchedi) [10:56:50] (03PS1) 10Vgutierrez: secrets: Mock of trusted_proxies.map [labs/private] - 10https://gerrit.wikimedia.org/r/1202994 [10:56:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:57:11] (03PS4) 10Brouberol: growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) [10:57:11] (03PS1) 10Brouberol: dse-k8s-eqiad: enable ceph-csi-cephfs in the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 [11:00:02] jmm@cumin2002 decommission (PID 420350) is awaiting input [11:02:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P85075 and previous config saved to /var/cache/conftool/dbconfig/20251107-110204-marostegui.json [11:02:18] jmm@cumin2002 reimage (PID 406428) is awaiting input [11:03:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:03:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:03:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps-test2004.codfw.wmnet [11:03:56] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353051 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps-test2004.codfw.wmnet` - maps-test2004.codfw.... [11:05:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [11:07:08] (03PS1) 10Majavah: hieradata: Remove obsolete haproxy_exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/1202997 (https://phabricator.wikimedia.org/T343885) [11:07:09] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secrets: Mock of trusted_proxies.map [labs/private] - 10https://gerrit.wikimedia.org/r/1202994 (owner: 10Vgutierrez) [11:10:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [11:10:36] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps-test2005.codfw.wmnet [11:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:15:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:17:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T407997)', diff saved to https://phabricator.wikimedia.org/P85076 and previous config saved to /var/cache/conftool/dbconfig/20251107-111712-marostegui.json [11:17:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:17:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [11:17:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T407997)', diff saved to https://phabricator.wikimedia.org/P85077 and previous config saved to /var/cache/conftool/dbconfig/20251107-111737-marostegui.json [11:20:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:23:20] jmm@cumin2002 decommission (PID 429800) is awaiting input [11:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:29:30] (03CR) 10Giuseppe Lavagetto: [C:03+1] Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [11:34:31] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen: Deploying v1.1.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202772 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [11:36:11] (03Merged) 10jenkins-bot: Test Kitchen: Deploying v1.1.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202772 (https://phabricator.wikimedia.org/T404458) (owner: 10Clare Ming) [11:38:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T407997)', diff saved to https://phabricator.wikimedia.org/P85078 and previous config saved to /var/cache/conftool/dbconfig/20251107-113801-marostegui.json [11:38:05] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:51:53] (03CR) 10Daniel Kinzler: [C:04-1] api-geteway: rename symbols used in restgw ratelimiter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [11:53:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P85079 and previous config saved to /var/cache/conftool/dbconfig/20251107-115309-marostegui.json [11:56:21] (03PS1) 10Daniel Kinzler: rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T0800) [12:00:05] jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T1200). [12:01:23] jmm@cumin2002 reimage (PID 429566) is awaiting input [12:05:53] (03PS1) 10Majavah: interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 [12:06:22] (03CR) 10CI reject: [V:04-1] interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 (owner: 10Majavah) [12:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P85081 and previous config saved to /var/cache/conftool/dbconfig/20251107-120816-marostegui.json [12:10:12] (03PS2) 10Majavah: interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 [12:10:42] (03CR) 10CI reject: [V:04-1] interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 (owner: 10Majavah) [12:11:42] (03PS3) 10Majavah: interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 [12:14:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:14:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:14:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps-test2005.codfw.wmnet [12:14:40] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353234 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps-test2005.codfw.wmnet` - maps-test2005.codfw.... [12:17:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7575/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203000 (owner: 10Majavah) [12:20:29] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Support enabling jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) [12:22:16] (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Support enabling jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) [12:23:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T407997)', diff saved to https://phabricator.wikimedia.org/P85082 and previous config saved to /var/cache/conftool/dbconfig/20251107-122324-marostegui.json [12:23:28] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:23:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [12:23:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T407997)', diff saved to https://phabricator.wikimedia.org/P85083 and previous config saved to /var/cache/conftool/dbconfig/20251107-122347-marostegui.json [12:29:10] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7577/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [12:30:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7576/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [12:43:47] jmm@cumin2002 decommission (PID 461226) is awaiting input [12:44:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T407997)', diff saved to https://phabricator.wikimedia.org/P85084 and previous config saved to /var/cache/conftool/dbconfig/20251107-124415-marostegui.json [12:44:20] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:45:48] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps-test2006.codfw.wmnet [12:49:26] jmm@cumin2002 decommission (PID 461226) is awaiting input [12:50:48] (03CR) 10Filippo Giunchedi: [C:03+1] interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 (owner: 10Majavah) [12:54:18] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloud_private_subnet: Support enabling jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [12:55:12] (03CR) 10Majavah: [V:03+1 C:03+2] interface: Add wrapper for specifying the MTU [puppet] - 10https://gerrit.wikimedia.org/r/1203000 (owner: 10Majavah) [12:58:20] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:58:38] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Support enabling jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/1203003 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [12:59:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P85085 and previous config saved to /var/cache/conftool/dbconfig/20251107-125923-marostegui.json [13:01:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:03:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:04:15] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Rename confusingly named iface variable [puppet] - 10https://gerrit.wikimedia.org/r/1203009 [13:04:46] jmm@cumin2002 decommission (PID 461226) is awaiting input [13:05:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps-test2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:05:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:05:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps-test2006.codfw.wmnet [13:05:33] 06SRE, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353310 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps-test2006.codfw.wmnet` - maps-test2006.codfw.... [13:07:03] (03CR) 10Gkyziridis: [C:03+1] "THNX for deploying!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202908 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [13:08:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:14] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update revertrisk-wikidata isvc in experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202908 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [13:10:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7578/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1203009 (owner: 10Majavah) [13:10:55] (03Merged) 10jenkins-bot: ml-services: update revertrisk-wikidata isvc in experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202908 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [13:12:34] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:14:10] (03PS1) 10Muehlenhoff: Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203011 (https://phabricator.wikimedia.org/T408777) [13:14:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P85086 and previous config saved to /var/cache/conftool/dbconfig/20251107-131431-marostegui.json [13:16:30] (03CR) 10CI reject: [V:04-1] Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203011 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [13:17:10] (03PS1) 10Muehlenhoff: Remove Puppet references to maps-test200[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/1203012 (https://phabricator.wikimedia.org/T409529) [13:18:12] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353363 (10MoritzMuehlenhoff) [13:22:44] (03PS2) 10Muehlenhoff: Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203011 (https://phabricator.wikimedia.org/T408777) [13:23:26] (03CR) 10Muehlenhoff: [C:03+2] Remove Puppet references to maps-test200[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/1203012 (https://phabricator.wikimedia.org/T409529) (owner: 10Muehlenhoff) [13:25:00] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353376 (10MoritzMuehlenhoff) [13:25:02] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloud_private_subnet: Rename confusingly named iface variable [puppet] - 10https://gerrit.wikimedia.org/r/1203009 (owner: 10Majavah) [13:25:22] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353377 (10MoritzMuehlenhoff) [13:25:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:25:35] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps-test2002/maps-test2003/maps-test2004/maps-test2005/maps-test2006 - https://phabricator.wikimedia.org/T409529#11353378 (10MoritzMuehlenhoff) [13:27:27] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Rename confusingly named iface variable [puppet] - 10https://gerrit.wikimedia.org/r/1203009 (owner: 10Majavah) [13:27:43] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11353388 (10elukey) I had to use `sum without(recorder) since the backfill process for edit-check caused another label to be added, ending up in errors while evaluating the `group_left()` (many-to-many relation... [13:27:54] (03PS1) 10Muehlenhoff: preseed: Remove old maps nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203013 (https://phabricator.wikimedia.org/T381565) [13:28:21] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T407997)', diff saved to https://phabricator.wikimedia.org/P85087 and previous config saved to /var/cache/conftool/dbconfig/20251107-132938-marostegui.json [13:29:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:29:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [13:30:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T407997)', diff saved to https://phabricator.wikimedia.org/P85088 and previous config saved to /var/cache/conftool/dbconfig/20251107-133002-marostegui.json [13:30:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11353402 (10Gehel) [13:30:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11353404 (10Gehel) [13:30:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11353406 (10Gehel) [13:31:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11353422 (10Gehel) [13:32:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11353435 (10Gehel) [13:33:17] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11353466 (10Gehel) [13:33:56] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work, 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11353479 (10Gehel) [13:37:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11353575 (10Gehel) [13:39:05] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [13:39:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11353600 (10Gehel) [13:46:04] !log Deploy schema change on x1 codfw master with replication T409539 [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:07] T409539: Add the sic_url_identifier column to the cusi_case table on WMF wikis - https://phabricator.wikimedia.org/T409539 [13:47:20] !log dpogorzelski@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=ml-serve1011.eqiad.wmnet,dc=eqiad,cluster=ml_serve,service=kubesvc [13:49:26] (03CR) 10Marostegui: [C:03+1] Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203011 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [13:51:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T407997)', diff saved to https://phabricator.wikimedia.org/P85089 and previous config saved to /var/cache/conftool/dbconfig/20251107-135111-marostegui.json [13:51:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:53:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11353728 (10Jclark-ctr) @Btullis have you completed your side so we can close ticket? [13:54:05] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:06:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P85090 and previous config saved to /var/cache/conftool/dbconfig/20251107-140619-marostegui.json [14:11:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [14:11:40] (03CR) 10Elukey: [C:03+1] preseed: Remove old maps nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203013 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:13:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11353776 (10Andrew) For my reference, the following will be the redundant pairs according to T401295 clouddb1013 & clouddb1017 clouddb1014 & clouddb1018 cloudd... [14:16:01] (03PS1) 10CDanis: intake-logging EventGate: store x-ja4h req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203022 [14:16:43] (03CR) 10Muehlenhoff: [C:03+2] preseed: Remove old maps nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203013 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:16:43] jouncebot: nowandnext [14:16:43] For the next 17 hour(s) and 43 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251107T0800) [14:16:44] In 17 hour(s) and 43 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251108T0800) [14:17:10] (03CR) 10Fabfur: [C:03+1] intake-logging EventGate: store x-ja4h req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203022 (owner: 10CDanis) [14:17:27] (03CR) 10Vgutierrez: [C:03+1] "header name is the right one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203022 (owner: 10CDanis) [14:18:32] (03PS1) 10Muehlenhoff: Fix Cumin aliases for maps following removal of buster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203023 (https://phabricator.wikimedia.org/T381565) [14:19:09] (03PS1) 10Brouberol: data: only keep brouberol's SSH key tied to the Yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1203024 (https://phabricator.wikimedia.org/T345633) [14:19:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203024 (https://phabricator.wikimedia.org/T345633) (owner: 10Brouberol) [14:20:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11353817 (10Jclark-ctr) Confirmed: Service Request 218364102 was successfully submitted. [14:21:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P85091 and previous config saved to /var/cache/conftool/dbconfig/20251107-142125-marostegui.json [14:23:01] (03CR) 10Muehlenhoff: [C:03+2] Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203011 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [14:23:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11353822 (10Aklapper) >>! In T409409#11349506, @hnowlan wrote: > could you let us know what username you would like for your accou... [14:23:21] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11353826 (10Andrew) [14:25:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:25:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11353827 (10Jhancock.wm) [14:28:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11353839 (10Andrew) a:05Andrew→03None [14:33:03] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Remove IPv4-only monitoring override [puppet] - 10https://gerrit.wikimedia.org/r/1202993 (owner: 10Majavah) [14:33:18] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Remove IPv4-only monitoring override [puppet] - 10https://gerrit.wikimedia.org/r/1202993 (owner: 10Majavah) [14:36:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T407997)', diff saved to https://phabricator.wikimedia.org/P85092 and previous config saved to /var/cache/conftool/dbconfig/20251107-143633-marostegui.json [14:36:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:36:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [14:36:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T407997)', diff saved to https://phabricator.wikimedia.org/P85093 and previous config saved to /var/cache/conftool/dbconfig/20251107-143657-marostegui.json [14:37:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203026 [14:37:38] (03CR) 10Tchanders: [C:03+1] Freeze LiquidThreads on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [14:37:54] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:38:56] (03PS1) 10Majavah: Revert "Add temporary trixie variant of db.cfg for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/1203027 [14:39:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1008-dev.eqiad.wmnet'] [14:40:14] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:41:05] (03CR) 10Majavah: [C:03+2] Revert "Add temporary trixie variant of db.cfg for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/1203027 (owner: 10Majavah) [14:41:47] (03PS1) 10Tchanders: Freeze LiquidThreads on enwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) [14:42:08] (03CR) 10Tchanders: [C:04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203028 (https://phabricator.wikimedia.org/T406717) (owner: 10Tchanders) [14:42:40] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev.eqiad.wmnet'] [14:42:53] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1008-dev.eqiad.wmnet'] [14:44:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [14:44:20] (03CR) 10Elukey: [C:03+1] Fix Cumin aliases for maps following removal of buster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203023 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:45:45] (03PS1) 10Muehlenhoff: Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203030 (https://phabricator.wikimedia.org/T408777) [14:46:41] (03CR) 10Brouberol: [C:03+2] data: only keep brouberol's SSH key tied to the Yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1203024 (https://phabricator.wikimedia.org/T345633) (owner: 10Brouberol) [14:46:48] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [14:47:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11353915 (10Marostegui) @Andrew I believe you also have to do the puppet patches. [14:48:17] (03CR) 10Muehlenhoff: [C:03+2] Add temporary trixie variant of db.cfg for debugging [puppet] - 10https://gerrit.wikimedia.org/r/1203030 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [14:49:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:49:28] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:49:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:52:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11353920 (10Jclark-ctr) @Marostegui @wiki_willy @RobH @Jhancock.wm FYI, this server was shipped with no 1G RJ45 ports — only an onboard daughter card with SFP ports. To avoid delay... [14:54:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T407997)', diff saved to https://phabricator.wikimedia.org/P85095 and previous config saved to /var/cache/conftool/dbconfig/20251107-145434-marostegui.json [14:54:38] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:55:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host db1264.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:58:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11353954 (10Marostegui) [14:59:33] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy3001.esams.wmnet with OS trixie [14:59:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [15:00:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11353961 (10Marostegui) >>! In T407897#11353919, @Jclark-ctr wrote: > @Marostegui @wiki_willy @RobH @Jhancock.wm FYI, this server was shipped with no 1G RJ45 ports — only an onboar... [15:00:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [15:02:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1264.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:02:17] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [15:04:50] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:04:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:05:10] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:07:25] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11353979 (10Jclark-ctr) >>! In T407897#11353961, @Marostegui wrote: >>>! In T407897#11353919, @Jclark-ctr wrote: >> @Marostegui @wiki_willy @RobH @Jhancock.wm FYI, this server was... [15:09:20] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host db1264.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:09:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P85096 and previous config saved to /var/cache/conftool/dbconfig/20251107-150941-marostegui.json [15:10:23] !log cdanis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host tcp-proxy3001.esams.wmnet with OS trixie [15:11:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11353986 (10Marostegui) >>! In T407897#11353979, @Jclark-ctr wrote: >>>! In T407897#11353961, @Marostegui wrote: >>>>! In T407897#11353919, @Jclark-ctr wrote: >>> @Marostegui @wiki_w... [15:12:38] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy3001.esams.wmnet with OS trixie [15:13:21] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:15:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11353991 (10Jclark-ctr) [15:19:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [15:21:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354036 (10Jclark-ctr) >>! In T407897#11353986, @Marostegui wrote: >>>! In T407897#11353979, @Jclark-ctr wrote: >>>>! In T407897#11353961, @Marostegui wrote: >>>>>! In T407897#113539... [15:23:19] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [15:23:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [15:24:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P85097 and previous config saved to /var/cache/conftool/dbconfig/20251107-152449-marostegui.json [15:26:20] jclark@cumin1003 provision (PID 995897) is awaiting input [15:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:28:00] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add x1/x4 wiki replicas cloudlb addresses - taavi@cumin1003" [15:28:05] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add x1/x4 wiki replicas cloudlb addresses - taavi@cumin1003" [15:28:05] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354070 (10RobH) >>! In T407897#11353979, @Jclark-ctr wrote: >>>! In T407897#11353961, @Marostegui wrote: >>>>! In T407897#11353919, @Jclark-ctr wrote: >>> @Marostegui @wiki_willy... [15:31:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354084 (10Jclark-ctr) {F70003484} [15:32:04] (03PS1) 10Majavah: hieradata: cloudlb: Add x1/x4 sections to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) [15:33:08] !log dpogorzelski@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=ml-serve1012.eqiad.wmnet,dc=eqiad,cluster=ml_serve,service=kubesvc [15:33:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:29] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7579/console" [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:34:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7580/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:37:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354110 (10RobH) I mis=parsed this entire thread apologies! The complaint isn't the host is missing 10G NIC, it is that it has no 1G NIC. This is expected, as not all custom builds... [15:38:17] (03CR) 10Xcollazo: "(Let's wait on merging this till at least Monday. There are a couple issues we need to look at.)" [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [15:38:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1264.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:39:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T407997)', diff saved to https://phabricator.wikimedia.org/P85098 and previous config saved to /var/cache/conftool/dbconfig/20251107-153957-marostegui.json [15:40:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:40:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [15:41:09] (03CR) 10Majavah: [V:03+1] "Two things I'm not sure about yet:" [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:44:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1132.eqiad.wmnet with reason: C/D Migration [15:45:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11354155 (10Solenne_Lazare_WMDE) Approved [15:46:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie [15:47:33] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc-gp1005.eqiad.wmnet with reason: C/D Migration [15:48:19] !log cdanis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [15:49:02] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin aliases for maps following removal of buster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203023 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:49:10] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Add x1/x4 sections to wiki replicas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:49:20] (03CR) 10FNegri: [C:04-1] hieradata: cloudlb: Add x1/x4 sections to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:49:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1242.eqiad.wmnet with reason: C/D Migration [15:51:07] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1243.eqiad.wmnet with reason: C/D Migration [15:51:15] (03PS2) 10Majavah: hieradata: cloudlb: Add x1/x4 sections to wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) [15:51:44] (03CR) 10Majavah: hieradata: cloudlb: Add x1/x4 sections to wiki replicas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:52:44] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1057.eqiad.wmnet with reason: C/D Migration [15:52:50] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [15:52:52] !log eqiad C2 switch migrations in progress [15:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:04] !log eqiad C3 switch migrations in progress [15:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1180.eqiad.wmnet with reason: C/D Migration [15:55:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [15:56:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T407997)', diff saved to https://phabricator.wikimedia.org/P85099 and previous config saved to /var/cache/conftool/dbconfig/20251107-155605-marostegui.json [15:56:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:56:13] (03CR) 10FNegri: "> I'm not 100% sure if this can go in before the new sections have the host data present in etcd (via conftool)." [puppet] - 10https://gerrit.wikimedia.org/r/1203042 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah) [15:57:21] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1166.eqiad.wmnet with reason: C/D Migration [15:57:31] (03CR) 10Dzahn: [C:03+2] allocate codfw VIP for load-balanced tcp-proxy service [dns] - 10https://gerrit.wikimedia.org/r/1202835 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [15:57:37] !log dzahn@dns1004 START - running authdns-update [15:57:41] andrew@cumin2002 reimage (PID 511315) is awaiting input [15:57:49] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1167.eqiad.wmnet with reason: C/D Migration [15:58:37] !log dzahn@dns1004 END - running authdns-update [15:58:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1230.eqiad.wmnet with reason: C/D Migration [15:59:02] (03PS1) 10DCausse: cirrus: start A/B test on completion with default_sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203046 (https://phabricator.wikimedia.org/T404858) [15:59:10] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [15:59:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-conf1006.eqiad.wmnet with reason: C/D Migration [16:01:57] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:13] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [16:03:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1252.eqiad.wmnet with reason: C/D Migration [16:03:48] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1150.eqiad.wmnet with reason: C/D Migration [16:04:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:51] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [16:05:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11354241 (10Arian_Bozorg) @taavi yes, thats the right one! [16:06:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbproxy1029.eqiad.wmnet with reason: C/D Migration [16:08:25] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on krb1002.eqiad.wmnet with reason: C/D Migration [16:08:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11354256 (10Dzahn) >>! In T409409#11349506, @hnowlan wrote: > what username you would like for your account? Usually we'd go with... [16:09:50] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3001.esams.wmnet with OS trixie [16:10:17] !log eqiad c3 network migrations complete for today, moving onto next rack [16:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T407997)', diff saved to https://phabricator.wikimedia.org/P85101 and previous config saved to /var/cache/conftool/dbconfig/20251107-161455-marostegui.json [16:14:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:17:31] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1244.eqiad.wmnet with reason: C/D Migration [16:18:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1245.eqiad.wmnet with reason: C/D Migration [16:19:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1231.eqiad.wmnet with reason: C/D Migration [16:20:14] !log eqiad c/d migration now working rack c6 [16:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:44] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy3001.esams.wmnet with OS trixie [16:21:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on pki1002.eqiad.wmnet with reason: C/D Migration [16:22:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354307 (10Marostegui) From our side both 1G or 10G is fine, whatever works best for you all [16:23:04] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbproxy1024.eqiad.wmnet with reason: C/D Migration [16:24:03] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1219.eqiad.wmnet with reason: C/D Migration [16:25:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1220.eqiad.wmnet with reason: C/D Migration [16:26:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1170.eqiad.wmnet with reason: C/D Migration [16:27:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1171.eqiad.wmnet with reason: C/D Migration [16:28:21] FIRING: JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:28:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mwlog1002.eqiad.wmnet with reason: C/D Migration [16:30:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P85102 and previous config saved to /var/cache/conftool/dbconfig/20251107-163003-marostegui.json [16:32:17] !log eqiad row C migrations complete for today, moving onto row D, D1 to start [16:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase1033.eqiad.wmnet with reason: C/D Migration [16:33:53] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on aqs1014.eqiad.wmnet with reason: C/D Migration [16:34:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [16:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:36:57] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [16:37:39] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on rdb1012.eqiad.wmnet with reason: C/D Migration [16:39:15] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1051.eqiad.wmnet with reason: C/D Migration [16:39:43] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1263.eqiad.wmnet with reason: C/D Migration [16:41:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1253.eqiad.wmnet with reason: C/D Migration [16:42:31] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1172.eqiad.wmnet with reason: C/D Migration [16:43:21] RESOLVED: JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:43:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1182.eqiad.wmnet with reason: C/D Migration [16:44:54] !log cdanis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [16:45:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P85103 and previous config saved to /var/cache/conftool/dbconfig/20251107-164510-marostegui.json [16:45:27] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11354398 (10elukey) Me and @tappof spent quite a bit of time today trying to debug the above problem, namely that the graph showed only some days in September and nothing more. The issue seemed the `sum_over_ti... [16:46:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on bast1003.wikimedia.org with reason: C/D Migration [16:46:50] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11354405 (10Jdrewniak) 05Open→03Resolved [16:47:43] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [16:49:14] (03PS1) 10CDanis: autoinstall: routed Ganeti: fix ipv6 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203052 (https://phabricator.wikimedia.org/T408064) [16:49:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cp1112.eqiad.wmnet with reason: C/D Migration [16:49:43] (03CR) 10CI reject: [V:04-1] autoinstall: routed Ganeti: fix ipv6 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203052 (https://phabricator.wikimedia.org/T408064) (owner: 10CDanis) [16:50:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:50:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:52:36] (03PS2) 10CDanis: autoinstall: routed Ganeti: fix ipv6 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203052 (https://phabricator.wikimedia.org/T408064) [16:56:01] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [16:56:37] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [16:57:23] !log cdanis@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy7002.magru.wmnet with OS trixie [16:58:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:58:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:58:48] (03CR) 10Dwisehaupt: [C:03+1] "Looks good. +1'ing so someone on the prod side and +2 and merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1202827 (https://phabricator.wikimedia.org/T367370) (owner: 10Jgreen) [17:00:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T407997)', diff saved to https://phabricator.wikimedia.org/P85104 and previous config saved to /var/cache/conftool/dbconfig/20251107-170018-marostegui.json [17:00:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:00:25] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [17:00:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [17:00:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T407997)', diff saved to https://phabricator.wikimedia.org/P85105 and previous config saved to /var/cache/conftool/dbconfig/20251107-170042-marostegui.json [17:00:51] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [17:00:55] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [17:00:58] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1175.eqiad.wmnet with reason: C/D Migration [17:01:25] (03CR) 10Ladsgroup: "Notified the community: https://nl.wikipedia.org/wiki/Wikipedia:De_kroeg#c-ASarabadani_(WMF)-20251107165900-Changing_default_size_of_thumb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [17:02:58] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3001.esams.wmnet with OS trixie [17:03:21] FIRING: JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:03:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1091.eqiad.wmnet with reason: C/D Migration [17:03:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2002-dev.codfw.wmnet with OS trixie [17:04:51] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on maps1014.eqiad.wmnet with reason: C/D Migration [17:05:28] !log eqiad d2 migrations in progress [17:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:28] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on titan1002.eqiad.wmnet with reason: C/D Migration [17:06:51] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wdqs1022.eqiad.wmnet with reason: C/D Migration [17:07:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:08:20] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1209.eqiad.wmnet with reason: C/D Migration [17:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-fe1020.eqiad.wmnet with reason: C/D Migration [17:10:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on druid1013.eqiad.wmnet with reason: C/D Migration [17:11:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203052 (https://phabricator.wikimedia.org/T408064) (owner: 10CDanis) [17:11:16] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11354501 (10Andrew) I've just noticed that there are quite a few 2-drive r450s that reimaged without trouble, for example cloudrabbit200[234]-dev. [17:13:18] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-presto1020.eqiad.wmnet with reason: C/D Migration [17:13:21] FIRING: [2x] JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:13:56] (03PS1) 10Giuseppe Lavagetto: cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) [17:13:58] (03PS1) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) [17:14:41] (03CR) 10CI reject: [V:04-1] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [17:14:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1227.eqiad.wmnet with reason: C/D Migration [17:15:33] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1228.eqiad.wmnet with reason: C/D Migration [17:16:53] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-fe1013.eqiad.wmnet with reason: C/D Migration [17:18:02] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on relforge1009.eqiad.wmnet with reason: C/D Migration [17:19:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wdqs1017.eqiad.wmnet with reason: C/D Migration [17:19:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T407997)', diff saved to https://phabricator.wikimedia.org/P85106 and previous config saved to /var/cache/conftool/dbconfig/20251107-171931-marostegui.json [17:19:35] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:21:31] !log eqiad d2 network migrations done for today, moving onto d3 [17:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:54] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [17:21:58] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudlb2002-dev.codfw.wmnet with reason: host reimage [17:22:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1247.eqiad.wmnet with reason: C/D Migration [17:23:22] !log cdanis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [17:24:48] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kubestage1004.eqiad.wmnet with reason: C/D Migration [17:25:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:26:21] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1135.eqiad.wmnet with reason: C/D Migration [17:28:00] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [17:28:29] !log cdanis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [17:29:01] !log cdanis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [17:29:37] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1136.eqiad.wmnet with reason: C/D Migration [17:30:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1137.eqiad.wmnet with reason: C/D Migration [17:31:24] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:31:39] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:31:52] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [17:34:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P85107 and previous config saved to /var/cache/conftool/dbconfig/20251107-173439-marostegui.json [17:37:42] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [17:37:55] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [17:39:16] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1138.eqiad.wmnet with reason: C/D Migration [17:39:27] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [17:40:03] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sun 23 Nov 2025 05:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [17:40:25] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:40:25] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:40:26] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1134.eqiad.wmnet with reason: C/D Migration [17:41:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cirrussearch1103.eqiad.wmnet with reason: C/D Migration [17:43:21] RESOLVED: JobUnavailable: Reduced availability for job tcp_proxy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1229.eqiad.wmnet with reason: C/D Migration [17:44:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1230.eqiad.wmnet with reason: C/D Migration [17:45:58] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1051.eqiad.wmnet with reason: C/D Migration [17:46:24] RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 3 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [17:47:27] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1052.eqiad.wmnet with reason: C/D Migration [17:48:06] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3002.esams.wmnet with OS trixie [17:49:11] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on thanos-fe1007.eqiad.wmnet with reason: C/D Migration [17:49:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P85108 and previous config saved to /var/cache/conftool/dbconfig/20251107-174946-marostegui.json [17:49:56] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7001.magru.wmnet with OS trixie [17:51:22] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cp1113.eqiad.wmnet with reason: C/D Migration [17:52:25] !log cdanis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7002.magru.wmnet with OS trixie [17:53:03] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1053.eqiad.wmnet with reason: C/D Migration [17:53:49] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on mc1054.eqiad.wmnet with reason: C/D Migration [17:55:25] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1067.eqiad.wmnet with reason: C/D Migration [17:57:04] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cp1114.eqiad.wmnet with reason: C/D Migration [17:57:18] (03CR) 10CDanis: [C:03+2] autoinstall: routed Ganeti: fix ipv6 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203052 (https://phabricator.wikimedia.org/T408064) (owner: 10CDanis) [17:57:53] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on kafka-logging1003.eqiad.wmnet with reason: C/D Migration [17:59:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11354598 (10CDanis) On trixie, the attempt to read the v6 address from the qemu variables in [[ https://gerri... [17:59:46] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moss-fe1002.eqiad.wmnet with reason: C/D Migration [18:00:07] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11354599 (10CDanis) 05Open→03Resolved [18:04:26] !log eqiad d4 migrations done for today [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T407997)', diff saved to https://phabricator.wikimedia.org/P85109 and previous config saved to /var/cache/conftool/dbconfig/20251107-180454-marostegui.json [18:04:58] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:07:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354607 (10RobH) I accidentally pasted the Day 1 update on a subtask: >>! In T405945#11351182, @RobH wrote: > Day 1 of migrations update: > > * 58 hosts mo... [18:10:40] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-be1093.eqiad.wmnet with reason: C/D Migration [18:11:03] !log eqiad d7 network port migations in progress [18:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:58] PROBLEM - Host cloudlb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [18:13:13] hi, I'm going to do a small Mediawiki config deploy [18:14:26] RECOVERY - Host cloudlb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [18:15:25] PROBLEM - haproxy alive on cloudlb2002-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [18:15:25] PROBLEM - Bird Internet Routing Daemon on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:15:25] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [18:16:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203022 (owner: 10CDanis) [18:16:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1231.eqiad.wmnet with reason: C/D Migration [18:16:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1231.eqiad.wmnet with reason: C/D Migration [18:16:59] (03Merged) 10jenkins-bot: intake-logging EventGate: store x-ja4h req hdr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203022 (owner: 10CDanis) [18:17:19] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1203022|intake-logging EventGate: store x-ja4h req hdr]] [18:17:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1231.eqiad.wmnet with reason: C/D Migration [18:19:28] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1203022|intake-logging EventGate: store x-ja4h req hdr]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:20:02] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on an-worker1152.eqiad.wmnet with reason: C/D Migration [18:20:04] !log cdanis@deploy2002 cdanis: Continuing with sync [18:20:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on prometheus1007.eqiad.wmnet with reason: C/D Migration [18:22:41] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbprov1004.eqiad.wmnet with reason: C/D Migration [18:23:25] RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [18:24:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cirrussearch1120.eqiad.wmnet with reason: C/D Migration [18:24:21] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203022|intake-logging EventGate: store x-ja4h req hdr]] (duration: 07m 02s) [18:24:28] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [18:24:37] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [18:25:27] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: sync [18:25:31] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync [18:26:24] PROBLEM - haproxy alive on cloudlb2002-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [18:26:56] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cirrussearch1121.eqiad.wmnet with reason: C/D Migration [18:28:24] PROBLEM - haproxy process on cloudlb2002-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [18:32:42] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cirrussearch1122.eqiad.wmnet with reason: C/D Migration [18:35:26] RECOVERY - haproxy process on cloudlb2002-dev is OK: PROCS OK: 1 process with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [18:36:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11354663 (10Dzahn) Thanks a lot for figuring this out and fixing it! [18:36:49] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [18:37:08] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [18:38:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354665 (10RobH) Day 2 update: * 73 servers moved today, 169 servers remain. * We (again) focused on moving hosts that did not require any specific scheduli... [18:38:26] PROBLEM - haproxy alive on cloudlb2002-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [18:39:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:40:21] !log eqiad c/d migration work complete for today [18:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11354723 (10Jclark-ctr) @cmooney Few things we ran into an-worker1136 Failed to ping after migration. changed cable port old and new showed link moved ba... [18:50:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host db1264.eqiad.wmnet with OS bookworm [18:59:53] (03PS1) 10Andrew Bogott: cloud haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) [19:01:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) (owner: 10Andrew Bogott) [19:07:21] (03CR) 10Majavah: [C:04-1] "I'm fairly sure you still need the plain `option httpchk` option to enable HTTP checking mode instead of the default TCP, in addition to t" [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) (owner: 10Andrew Bogott) [19:08:34] jclark@cumin1003 reimage (PID 1040702) is awaiting input [19:11:49] (03PS2) 10Andrew Bogott: cloud haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) [19:12:59] (03CR) 10Andrew Bogott: "yep, that's what logfile says too :)" [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) (owner: 10Andrew Bogott) [19:13:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) (owner: 10Andrew Bogott) [19:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:15:27] RECOVERY - haproxy alive on cloudlb2002-dev is OK: OK check_alive uptime 301s https://wikitech.wikimedia.org/wiki/HAProxy [19:16:56] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1264.eqiad.wmnet with OS bookworm [19:17:23] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host db1264.eqiad.wmnet with OS bookworm [19:24:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission maps1005/maps1006/maps1007/maps1008/maps1009/map1010 - https://phabricator.wikimedia.org/T409399#11354808 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:24:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:25:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11354812 (10Jclark-ctr) [19:25:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11354813 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:27:23] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:32:21] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [19:38:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [19:43:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11354869 (10Jclark-ctr) a:05cmooney→03Jclark-ctr [19:55:08] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:56:28] RECOVERY - Bird Internet Routing Daemon on cloudlb2002-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:56:56] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:57:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:57:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1264.eqiad.wmnet with OS bookworm [19:57:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2002-dev (172.20.5.3) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:58:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2002-dev.codfw.wmnet with OS trixie [19:59:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354897 (10Jclark-ctr) a:05Marostegui→03Jclark-ctr [20:01:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354903 (10Jclark-ctr) Server did finish imaging and passed. Forgot to put ticket number in cookbook ` •logmsgbot> !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netb... [20:01:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11354904 (10Jclark-ctr) 05Open→03Resolved [20:11:11] (03CR) 10Andrew Bogott: [C:03+2] cloud haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203076 (https://phabricator.wikimedia.org/T409580) (owner: 10Andrew Bogott) [20:16:11] (03PS1) 10Catrope: i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) [20:16:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) (owner: 10Catrope) [20:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:36:54] (03PS1) 10Catrope: OATHManage: Don't always set the page title to "Create new recovery codes" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 [20:37:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 (owner: 10Catrope) [20:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:54:42] (03PS4) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [20:57:48] (03PS5) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [21:00:00] (03CR) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [21:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:04:10] (03PS6) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [21:05:00] (03PS5) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [21:08:43] (03PS1) 10Stoyofuku-wmf: Use addModuleStyles for ReadingList icons [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) [21:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) (owner: 10Stoyofuku-wmf) [21:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:14:07] (03CR) 10Pmiazga: "could you explain what needs to change in ratelimiter_metrics.yaml ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [21:22:38] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) [21:38:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11355132 (10RobH) [21:54:49] (03CR) 10C. Scott Ananian: [C:03+1] Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra) [22:24:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:44] (03PS1) 10Dzahn: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) [22:40:49] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [22:40:53] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:43:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2003-dev.codfw.wmnet with OS trixie [22:46:56] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:47:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2003-dev (172.20.5.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:48:26] (03PS1) 10Andrew Bogott: toolforge haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203175 [22:49:31] (03PS2) 10Andrew Bogott: toolforge haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203175 [22:53:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11355319 (10Andrew) a:03Andrew You're right! [22:58:12] (03CR) 10Dzahn: [C:04-1] "uninitialized constant Puppet::Pops::Loader::RubyFunctionInstantiator::Yaml :(" [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:00:57] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [23:01:38] (03PS2) 10Arlolra: Deploy Parsoid Read Views to 14 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) [23:04:48] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2003-dev.codfw.wmnet with reason: host reimage [23:08:55] (03CR) 10Dzahn: [C:04-1] ""indicates that the Puppet environment is missing access to the necessary YAML library. This commonly happens due to an issue with the Rub" [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:09:58] 10SRE-Access-Requests: New SSH key for Brett Cornwall - https://phabricator.wikimedia.org/T409600 (10BCornwall) 03NEW [23:10:12] (03PS1) 10BCornwall: admin: Update brett SSH key to FIDO [puppet] - 10https://gerrit.wikimedia.org/r/1203179 (https://phabricator.wikimedia.org/T409600) [23:14:05] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:15:14] (03CR) 10Dzahn: [C:04-1] "Any idea why it's getting this "uninitialized constant" error? If it was just some missing Hiera keys I would expect a different error and" [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:24:55] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:27:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2003-dev.codfw.wmnet with OS trixie [23:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudlb2003-dev (172.20.5.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown