[00:43:47] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:17:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:12:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:16:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:17:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:11] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:58:11] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:41:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:46:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:23:25] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 187 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:25:27] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:42:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance [05:43:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2127.codfw.wmnet with reason: Maintenance [06:10:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:10:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:19:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1121.eqiad.wmnet with reason: Maintenance [06:19:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1121.eqiad.wmnet with reason: Maintenance [06:19:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:20:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T321126)', diff saved to https://phabricator.wikimedia.org/P41258 and previous config saved to /var/cache/conftool/dbconfig/20221128-062008-marostegui.json [06:20:14] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:25:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321126)', diff saved to https://phabricator.wikimedia.org/P41259 and previous config saved to /var/cache/conftool/dbconfig/20221128-062516-marostegui.json [06:25:23] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:34:17] (03CR) 10Marostegui: [C: 03+1] mariadb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860913 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:34:34] (03CR) 10Marostegui: [C: 03+1] mariadb::proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860911 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:36:56] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) MySQL is now off again, so @Papaul you can do the test whenever you can. [06:37:40] (03PS1) 10Marostegui: control-mysql-5.7: We won't use 5.7 [software] - 10https://gerrit.wikimedia.org/r/861193 [06:38:17] (03CR) 10Marostegui: [C: 03+2] control-mysql-5.7: We won't use 5.7 [software] - 10https://gerrit.wikimedia.org/r/861193 (owner: 10Marostegui) [06:38:50] (03Merged) 10jenkins-bot: control-mysql-5.7: We won't use 5.7 [software] - 10https://gerrit.wikimedia.org/r/861193 (owner: 10Marostegui) [06:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P41260 and previous config saved to /var/cache/conftool/dbconfig/20221128-064022-marostegui.json [06:42:21] (03PS2) 10Kosta Harlan: GrowthExperiments: Start newimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [06:42:40] (03PS3) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [06:43:43] (03PS4) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [06:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P41261 and previous config saved to /var/cache/conftool/dbconfig/20221128-065529-marostegui.json [07:00:15] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:02:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:08:45] (03PS1) 10KartikMistry: Update cxserver to 2022-11-28-053412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861195 (https://phabricator.wikimedia.org/T323825) [07:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321126)', diff saved to https://phabricator.wikimedia.org/P41262 and previous config saved to /var/cache/conftool/dbconfig/20221128-071035-marostegui.json [07:10:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:10:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:10:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:10:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T321126)', diff saved to https://phabricator.wikimedia.org/P41263 and previous config saved to /var/cache/conftool/dbconfig/20221128-071057-marostegui.json [07:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321126)', diff saved to https://phabricator.wikimedia.org/P41264 and previous config saved to /var/cache/conftool/dbconfig/20221128-071306-marostegui.json [07:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P41265 and previous config saved to /var/cache/conftool/dbconfig/20221128-072813-marostegui.json [07:31:01] (03CR) 10ArielGlenn: [C: 03+1] dumps/distribution: add more data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [07:31:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove the parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/860703 (owner: 10Giuseppe Lavagetto) [07:35:37] (03Merged) 10jenkins-bot: Remove the parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/860703 (owner: 10Giuseppe Lavagetto) [07:36:41] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:37:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] miscweb: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860510 (owner: 10Giuseppe Lavagetto) [07:38:43] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:42:41] (03Merged) 10jenkins-bot: miscweb: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860510 (owner: 10Giuseppe Lavagetto) [07:43:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P41266 and previous config saved to /var/cache/conftool/dbconfig/20221128-074319-marostegui.json [07:43:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] recommendation-api: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860511 (owner: 10Giuseppe Lavagetto) [07:47:58] (03Merged) 10jenkins-bot: recommendation-api: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860511 (owner: 10Giuseppe Lavagetto) [07:53:43] (03PS2) 10Muehlenhoff: graphite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860910 (https://phabricator.wikimedia.org/T308013) [07:58:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860945 (https://phabricator.wikimedia.org/T322670) (owner: 10Andrea Denisse) [07:58:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321126)', diff saved to https://phabricator.wikimedia.org/P41267 and previous config saved to /var/cache/conftool/dbconfig/20221128-075826-marostegui.json [07:58:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:58:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:58:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:58:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T321126)', diff saved to https://phabricator.wikimedia.org/P41268 and previous config saved to /var/cache/conftool/dbconfig/20221128-075847-marostegui.json [07:59:00] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860910 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:37] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 208 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:00:41] * kart_ is here and will self deploy.. [08:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321126)', diff saved to https://phabricator.wikimedia.org/P41269 and previous config saved to /var/cache/conftool/dbconfig/20221128-080057-marostegui.json [08:02:39] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:02:39] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:02:40] (03PS2) 10KartikMistry: Content Translation: Reverse MT threshold for Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860701 (https://phabricator.wikimedia.org/T323721) [08:03:41] (03CR) 10Jelto: [C: 03+2] gitlab_runner: make one Shared Runner canary [puppet] - 10https://gerrit.wikimedia.org/r/858188 (owner: 10Jelto) [08:04:06] !log rebalance Ganeti group C/codfw following reboots [08:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860701 (https://phabricator.wikimedia.org/T323721) (owner: 10KartikMistry) [08:06:16] (03Merged) 10jenkins-bot: Content Translation: Reverse MT threshold for Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860701 (https://phabricator.wikimedia.org/T323721) (owner: 10KartikMistry) [08:07:50] !log kartik@deploy1002 Backport cancelled. [08:08:27] James_F: "There were unexpected commits pulled from origin for /srv/mediawiki-staging." Did you forget something to deploy? [08:09:03] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/860873 (https://phabricator.wikimedia.org/T313856) (owner: 10Muehlenhoff) [08:09:07] (03PS1) 10TrainBranchBot: Revert "Content Translation: Reverse MT threshold for Japanese Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861341 [08:09:09] (03CR) 10TrainBranchBot: "kartik@deploy1002 created a revert of this change as I8f4434220bd2d53947fd2eaab55fe47d80e36f8a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860701 (https://phabricator.wikimedia.org/T323721) (owner: 10KartikMistry) [08:09:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861341 (owner: 10TrainBranchBot) [08:09:40] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [08:09:59] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [08:10:15] (03Merged) 10jenkins-bot: Revert "Content Translation: Reverse MT threshold for Japanese Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861341 (owner: 10TrainBranchBot) [08:10:18] (03PS3) 10Slyngshede: Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 [08:10:27] !log kartik@deploy1002 Started scap: Backport for [[gerrit:861341|Revert "Content Translation: Reverse MT threshold for Japanese Wikipedia"]] [08:10:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:11:32] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [08:11:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:11:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:11:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [08:12:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:15:03] Not sure - but scap revert seems stuck at, `08:10:29 K8s images build/push output redirected to /home/kartik/scap-image-build-and-push-log` [08:15:20] (03CR) 10Slyngshede: Allow multiple server connections to be defined. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [08:16:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P41270 and previous config saved to /var/cache/conftool/dbconfig/20221128-081603-marostegui.json [08:16:57] !log kartik@deploy1002 kartik and trainbranchbot: Backport for [[gerrit:861341|Revert "Content Translation: Reverse MT threshold for Japanese Wikipedia"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:18:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:19:05] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [08:19:30] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [08:21:28] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:21:39] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:861341|Revert "Content Translation: Reverse MT threshold for Japanese Wikipedia"]] (duration: 11m 12s) [08:21:44] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:21:52] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:22:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:22:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:22:20] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:24:58] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [08:25:25] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [08:25:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:26:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38449/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:30:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:31:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P41271 and previous config saved to /var/cache/conftool/dbconfig/20221128-083110-marostegui.json [08:32:43] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:35:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:35:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:35:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [08:35:38] (03PS2) 10Slyngshede: WIP C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:37:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38450/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:37:42] (03CR) 10CI reject: [V: 04-1] WIP C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:39:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:42:15] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:43:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:43:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [08:44:39] (03PS3) 10Slyngshede: WIP C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:46:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38451/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:46:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321126)', diff saved to https://phabricator.wikimedia.org/P41272 and previous config saved to /var/cache/conftool/dbconfig/20221128-084616-marostegui.json [08:46:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:46:24] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:46:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T321126)', diff saved to https://phabricator.wikimedia.org/P41273 and previous config saved to /var/cache/conftool/dbconfig/20221128-084637-marostegui.json [08:51:43] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:55:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:58:19] (03CR) 10David Caro: [C: 03+1] "LGTM feel free to ignore the nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 (owner: 10Arturo Borrero Gonzalez) [09:00:23] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10Marostegui) @lmata ping [09:03:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10Marostegui) @Volans do you want to keep this open? [09:04:45] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Marostegui) @lmata what should we do with this follow up task? [09:05:01] 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10Marostegui) [09:05:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10Volans) @Marostegui Good question, I'm not aware of other occurrences of the same issue, so it can probably be closed. @fgiunchedi any thoughts? [09:06:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:06:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:06:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10Marostegui) @MoritzMuehlenhoff is this all done? [09:07:23] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Marostegui) What should we do with this task? Anything left? [09:07:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:08:18] 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Marostegui) 05Open→03Resolved a:03Marostegui I am going to tentatively close this as fixed per T317011#8212217. Please reopen if it is not the case. [09:08:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:08:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2032.codfw.wmnet to cluster codfw and group B [09:09:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: common: allow to list servers with extra information (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860915 (owner: 10Arturo Borrero Gonzalez) [09:09:08] 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Marostegui) a:05Marostegui→03None [09:09:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10Marostegui) 05Open→03Resolved a:03MoritzMuehlenhoff I am assuming {T317416} takes over, so closing this [09:12:19] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:12:26] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) 05Resolved→03Open Actually, the openssh update is still TBD, reopening until I have completed that one. [09:12:32] !log rebalance Ganeti group A/eqiad T311687 [09:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:39] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:14:13] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:14:57] RECOVERY - Ganeti memory on ganeti1023 is OK: OK Memory 73% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [09:15:17] (03CR) 10FNegri: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [09:15:46] (03PS4) 10Slyngshede: WIP C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [09:16:21] (03Abandoned) 10Awight: Send PostgreSQL logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/853941 (https://phabricator.wikimedia.org/T321887) (owner: 10Awight) [09:16:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Agreed, I'm not aware of further occurrences. I'll be BOLD and resolve the task, thank you! [09:18:25] 10SRE, 10ops-ulsfo, 10Infrastructure-Foundations: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10Marostegui) 05Open→03Resolved a:03Marostegui The RAID is actually ok ` Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sd... [09:18:44] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) As soon as I finish the wikitech description I intend to resolve it. [09:18:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38452/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:19:16] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Marostegui) a:03Papaul [09:20:18] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returning 500's - https://phabricator.wikimedia.org/T323448 (10Marostegui) 05Open→03Resolved Works for me too now. Going to resolve it for now.Please reopen if you run into this again. Thanks for reporting [09:20:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::designate: remove separate profile for firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/854539 (owner: 10Majavah) [09:22:58] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: mirror traffic to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/860521 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [09:23:03] (03PS2) 10Filippo Giunchedi: graphite: mirror traffic to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/860521 (https://phabricator.wikimedia.org/T318903) [09:26:27] (03PS1) 10Ilias Sarantopoulos: enalbe multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) [09:27:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] New upstream release [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/860846 (https://phabricator.wikimedia.org/T303154) (owner: 10Filippo Giunchedi) [09:29:46] (03CR) 10Filippo Giunchedi: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [09:30:31] (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [09:31:15] (03CR) 10Arturo Borrero Gonzalez: "I'm a bit confused about this." [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah) [09:34:16] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860924 (owner: 10Arturo Borrero Gonzalez) [09:35:21] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 212 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:36:25] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:37:45] (03CR) 10Filippo Giunchedi: [C: 03+2] "Going ahead, let me know feedback post-review too" [alerts] - 10https://gerrit.wikimedia.org/r/860609 (owner: 10Filippo Giunchedi) [09:38:03] (03CR) 10David Caro: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [09:40:24] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) [09:40:36] 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10jcrespo) [09:40:40] (03CR) 10Vgutierrez: [C: 03+1] hiera: unify ulsfo LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/860930 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [09:40:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: inventory: add support to network information (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860924 (owner: 10Arturo Borrero Gonzalez) [09:40:52] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) 05Open→03Resolved a:03Vgutierrez @hashar @Vgutierrez Please review my summary of the incident at: https://wikitech.wikimedia.org/wik... [09:41:00] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10BTullis) I'm not sure that there's much more to do, is there? Fro... [09:41:12] (03CR) 10Hashar: [C: 03+1] "Valentin and I clarified this is the first phase, the next step will be to remove port 80 / plain HTTP entirely later on." [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [09:41:36] (03CR) 10Vgutierrez: [C: 03+2] gerrit: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859986 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [09:43:11] (03CR) 10Ayounsi: Add function to int_automation to validate QFX5120 port blocks (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [09:43:18] (03PS5) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [09:43:28] (03CR) 10Muehlenhoff: [C: 03+2] Add role_contacts for mwlog [puppet] - 10https://gerrit.wikimedia.org/r/860886 (owner: 10Muehlenhoff) [09:44:18] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for envoyproxy on Grafana [puppet] - 10https://gerrit.wikimedia.org/r/860576 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:45:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:45:28] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [09:45:34] (03PS2) 10Muehlenhoff: zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860907 (https://phabricator.wikimedia.org/T308013) [09:46:31] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321126)', diff saved to https://phabricator.wikimedia.org/P41274 and previous config saved to /var/cache/conftool/dbconfig/20221128-094654-marostegui.json [09:47:01] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:48:03] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [09:48:45] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [09:51:52] 10SRE, 10ops-ulsfo, 10Infrastructure-Foundations: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10MoritzMuehlenhoff) This was some alert spam during initial setup; this is one of the new servers in ulsfo. [09:53:13] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10hashar) For Gerrit, I have made the announcement on [[ https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/WGIDWKB4YN3DM7K... [09:55:54] 10SRE, 10Infrastructure-Foundations: Provide an option menu when booting via PXE - https://phabricator.wikimedia.org/T191018 (10LSobanski) Clinic Duty drive-by tagging. [09:56:19] 10SRE, 10Infrastructure-Foundations: Provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135 (10LSobanski) Clinic Duty drive-by tagging. [09:56:57] 10SRE, 10DC-Ops, 10Tracking-Neverending: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10LSobanski) Clinic Duty drive-by tagging. [09:57:08] (03CR) 10David Caro: [C: 03+1] wmcs: openstack: inventory: add support to network information (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/860924 (owner: 10Arturo Borrero Gonzalez) [10:02:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P41275 and previous config saved to /var/cache/conftool/dbconfig/20221128-100200-marostegui.json [10:07:59] (03PS1) 10Elukey: knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) [10:09:31] (03PS3) 10David Caro: harbor: remove support for (03PS3) 10David Caro: harbor: remove unused harbor::db module/role [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) [10:09:35] (03PS8) 10David Caro: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [10:09:37] (03PS2) 10David Caro: harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) [10:14:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:05] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P41276 and previous config saved to /var/cache/conftool/dbconfig/20221128-101706-marostegui.json [10:20:28] (03PS14) 10Clément Goubert: opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 [10:20:52] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1043: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/861350 (https://phabricator.wikimedia.org/T319184) [10:21:49] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1043: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/861350 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:22:38] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38453/console" [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:23:51] (03CR) 10Clément Goubert: opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:24:17] (03CR) 10Clément Goubert: [V: 03+1] "PCC OK, see above." [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:28:35] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/861351 [10:29:55] (03PS1) 10JMeybohm: Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) [10:29:57] (03PS1) 10JMeybohm: update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) [10:30:13] (03CR) 10Jgiannelos: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [10:31:38] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye [10:31:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1043.eqiad.wmnet with O... [10:32:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1043: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/861350 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:32:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321126)', diff saved to https://phabricator.wikimedia.org/P41277 and previous config saved to /var/cache/conftool/dbconfig/20221128-103213-marostegui.json [10:32:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:32:19] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:32:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:32:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41278 and previous config saved to /var/cache/conftool/dbconfig/20221128-103234-marostegui.json [10:32:42] (03CR) 10Muehlenhoff: opentelemetry-collector: Basic install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:33:55] (03CR) 10Filippo Giunchedi: [C: 03+2] dcops: switch mgmt down alerts to open tasks [alerts] - 10https://gerrit.wikimedia.org/r/860525 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:33:59] (03PS3) 10Filippo Giunchedi: dcops: switch mgmt down alerts to open tasks [alerts] - 10https://gerrit.wikimedia.org/r/860525 (https://phabricator.wikimedia.org/T310266) [10:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41279 and previous config saved to /var/cache/conftool/dbconfig/20221128-103444-marostegui.json [10:35:00] (03PS2) 10JMeybohm: Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) [10:35:02] (03PS2) 10JMeybohm: update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) [10:35:59] (03PS15) 10Clément Goubert: opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 [10:36:36] (03CR) 10JMeybohm: "All the yaml in config/ is auto generated by the operator-sdk" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [10:38:10] (03PS16) 10Clément Goubert: opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 [10:39:05] (03PS21) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [10:39:09] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38455/console" [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:39:46] (03CR) 10Clément Goubert: [V: 03+1] opentelemetry-collector: Basic install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:40:06] (03PS3) 10David Caro: harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) [10:41:18] (03CR) 10CI reject: [V: 04-1] harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [10:46:47] (03CR) 10Muehlenhoff: [C: 03+1] "I don't have any insight on the content of the service YAML config, but the puppetisation part looks good" [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:48:33] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1043.eqiad.wmnet with OS bullseye [10:48:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bu... [10:48:59] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye [10:49:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1043.eqiad.wmnet with O... [10:49:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P41280 and previous config saved to /var/cache/conftool/dbconfig/20221128-104950-marostegui.json [10:51:58] (03CR) 10Clément Goubert: [V: 03+1] opentelemetry-collector: Basic install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:52:30] (03PS4) 10David Caro: harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) [10:53:18] (03CR) 10David Caro: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [10:54:11] (03PS17) 10Clément Goubert: opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 [10:54:50] (03CR) 10David Caro: [C: 03+2] p::metricsinfra:haproxy: rename some vars to reflect intent [puppet] - 10https://gerrit.wikimedia.org/r/831036 (owner: 10David Caro) [10:55:15] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38456/console" [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [10:56:59] (03CR) 10FNegri: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [10:58:08] (03PS4) 10David Caro: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [10:58:43] (03CR) 10David Caro: Remove support for overriding LDAP client stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [10:59:40] (03PS1) 10Filippo Giunchedi: wmnet: move read traffic to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861356 (https://phabricator.wikimedia.org/T318903) [10:59:42] (03PS1) 10Filippo Giunchedi: wmnet: move writes to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) [10:59:47] (03PS2) 10Filippo Giunchedi: hieradata: pool graphite1005 for reads [puppet] - 10https://gerrit.wikimedia.org/r/860522 (https://phabricator.wikimedia.org/T318903) [10:59:49] (03PS1) 10Filippo Giunchedi: graphite: move alerts to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861358 (https://phabricator.wikimedia.org/T318903) [10:59:51] (03PS1) 10Filippo Giunchedi: stats: failover writes to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) [11:02:18] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [11:02:28] (03CR) 10Hnowlan: [C: 03+2] api-gateway: expose restbase /api/ endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [11:03:07] PROBLEM - SSH on mw1320.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:03:17] (03PS1) 10Filippo Giunchedi: ProductionServices: move to graphite1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) [11:03:48] (03PS3) 10JMeybohm: Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) [11:03:50] (03PS3) 10JMeybohm: update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) [11:04:03] (03CR) 10CI reject: [V: 04-1] ProductionServices: move to graphite1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [11:04:43] (03CR) 10David Caro: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [11:04:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P41281 and previous config saved to /var/cache/conftool/dbconfig/20221128-110456-marostegui.json [11:05:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [11:05:50] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [11:06:27] (03PS22) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [11:07:00] (03CR) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs (038 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [11:07:42] (03Merged) 10jenkins-bot: api-gateway: expose restbase /api/ endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [11:07:44] (03CR) 10David Caro: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [11:10:47] (03PS2) 10Filippo Giunchedi: ProductionServices: move to graphite1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) [11:12:12] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Happy to see this code gone. I originally introduced it some time ago, and have tried to remove it a few times already. It was never the r" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [11:14:29] (03CR) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [11:14:41] (03CR) 10Phedenskog: [C: 04-1] "I don't have privileges to abandon this, but we should since we will not use WebPageTest + we wouldn't use the non open source version on " [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [11:15:26] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] opentelemetry-collector: Basic install [puppet] - 10https://gerrit.wikimedia.org/r/856931 (owner: 10Clément Goubert) [11:16:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2032.codfw.wmnet to cluster codfw and group B [11:20:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41282 and previous config saved to /var/cache/conftool/dbconfig/20221128-112003-marostegui.json [11:20:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:20:11] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:20:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:20:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:20:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:20:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41283 and previous config saved to /var/cache/conftool/dbconfig/20221128-112053-marostegui.json [11:21:54] (03PS1) 10Clément Goubert: opentelemetry::collector: Fix service ensure [puppet] - 10https://gerrit.wikimedia.org/r/861362 (https://phabricator.wikimedia.org/T320565) [11:22:41] (03PS2) 10Clément Goubert: opentelemetry::collector: Fix service ensure [puppet] - 10https://gerrit.wikimedia.org/r/861362 (https://phabricator.wikimedia.org/T320565) [11:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41284 and previous config saved to /var/cache/conftool/dbconfig/20221128-112302-marostegui.json [11:23:41] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38457/console" [puppet] - 10https://gerrit.wikimedia.org/r/861362 (https://phabricator.wikimedia.org/T320565) (owner: 10Clément Goubert) [11:25:03] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] opentelemetry::collector: Fix service ensure [puppet] - 10https://gerrit.wikimedia.org/r/861362 (https://phabricator.wikimedia.org/T320565) (owner: 10Clément Goubert) [11:26:57] (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860907 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:29:54] (03PS2) 10Muehlenhoff: ceph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860908 (https://phabricator.wikimedia.org/T308013) [11:29:56] (03PS1) 10Clément Goubert: opentelemetry::collector: Fix config template [puppet] - 10https://gerrit.wikimedia.org/r/861364 (https://phabricator.wikimedia.org/T320565) [11:30:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bullseye [11:30:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bu... [11:30:59] (03CR) 10FNegri: [C: 03+1] "I think if it works on toolsbeta-harbor-1 it's good enough for now, and we'll probably migrate this to k8s sooner or later." [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [11:31:09] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38458/console" [puppet] - 10https://gerrit.wikimedia.org/r/861364 (https://phabricator.wikimedia.org/T320565) (owner: 10Clément Goubert) [11:31:36] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] opentelemetry::collector: Fix config template [puppet] - 10https://gerrit.wikimedia.org/r/861364 (https://phabricator.wikimedia.org/T320565) (owner: 10Clément Goubert) [11:32:57] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:16] (03CR) 10Muehlenhoff: [C: 03+2] ceph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860908 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P41285 and previous config saved to /var/cache/conftool/dbconfig/20221128-113809-marostegui.json [11:51:01] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/861351 (owner: 10Muehlenhoff) [11:51:11] (03PS1) 10Hnowlan: thumbor: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/861367 (https://phabricator.wikimedia.org/T323775) [11:53:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P41286 and previous config saved to /var/cache/conftool/dbconfig/20221128-115316-marostegui.json [11:55:50] (03PS1) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) [11:57:27] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:59:23] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:07:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:07:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:07:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T323827)', diff saved to https://phabricator.wikimedia.org/P41287 and previous config saved to /var/cache/conftool/dbconfig/20221128-120727-ladsgroup.json [12:07:33] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41288 and previous config saved to /var/cache/conftool/dbconfig/20221128-120822-marostegui.json [12:08:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:08:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:08:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:08:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T321126)', diff saved to https://phabricator.wikimedia.org/P41289 and previous config saved to /var/cache/conftool/dbconfig/20221128-120843-marostegui.json [12:09:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] similar-users: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860706 (owner: 10Giuseppe Lavagetto) [12:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321126)', diff saved to https://phabricator.wikimedia.org/P41290 and previous config saved to /var/cache/conftool/dbconfig/20221128-121052-marostegui.json [12:13:41] (03CR) 10Hnowlan: [C: 03+2] thumbor: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/861367 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [12:14:07] (03Merged) 10jenkins-bot: similar-users: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860706 (owner: 10Giuseppe Lavagetto) [12:14:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860708 (owner: 10Giuseppe Lavagetto) [12:17:15] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:18:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:18:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:18:21] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/similar-users: apply [12:18:22] (03Merged) 10jenkins-bot: thumbor: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/861367 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [12:18:29] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/similar-users: apply [12:18:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:18:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:19:47] (03Merged) 10jenkins-bot: termbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860708 (owner: 10Giuseppe Lavagetto) [12:20:44] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/similar-users: apply [12:21:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:22:11] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:22:28] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/similar-users: apply [12:26:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P41291 and previous config saved to /var/cache/conftool/dbconfig/20221128-122559-marostegui.json [12:30:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:31:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:31:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [12:32:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [12:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T323907)', diff saved to https://phabricator.wikimedia.org/P41292 and previous config saved to /var/cache/conftool/dbconfig/20221128-123206-ladsgroup.json [12:32:12] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:32:31] (03PS1) 10Hnowlan: thumbor: Correct paths for 3d2png and tinyrgb [deployment-charts] - 10https://gerrit.wikimedia.org/r/861383 (https://phabricator.wikimedia.org/T323775) [12:32:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:32:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:32:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:32:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41293 and previous config saved to /var/cache/conftool/dbconfig/20221128-123251-ladsgroup.json [12:33:01] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:33:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:33:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db2109', diff saved to https://phabricator.wikimedia.org/P41294 and previous config saved to /var/cache/conftool/dbconfig/20221128-123312-ladsgroup.json [12:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41295 and previous config saved to /var/cache/conftool/dbconfig/20221128-123317-ladsgroup.json [12:33:21] (03CR) 10David Caro: harbor: ensure that it's started (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [12:35:46] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10lmata) 05Open→03Resolved a:03lmata Thank you @BTullis for T... [12:36:53] (03CR) 10David Caro: [C: 03+1] "The blocker is gone, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [12:37:11] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/similar-users: apply [12:37:50] (03CR) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [12:38:44] (03PS23) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [12:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323827)', diff saved to https://phabricator.wikimedia.org/P41296 and previous config saved to /var/cache/conftool/dbconfig/20221128-123845-ladsgroup.json [12:38:52] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:38:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:38:58] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/similar-users: apply [12:39:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:39:49] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:40:19] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:40:29] (03PS24) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [12:40:37] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:40:55] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [12:41:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P41297 and previous config saved to /var/cache/conftool/dbconfig/20221128-124105-marostegui.json [12:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41298 and previous config saved to /var/cache/conftool/dbconfig/20221128-124125-ladsgroup.json [12:44:13] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [12:44:22] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [12:45:21] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [12:45:34] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10MoritzMuehlenhoff) There's also a fourth option that comes to my mind: Debian already ships various ICC profiles, in two separate packages: https://tracker.de... [12:46:35] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [12:47:10] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [12:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41299 and previous config saved to /var/cache/conftool/dbconfig/20221128-125056-ladsgroup.json [12:51:03] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:51:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:51:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:51:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:51:44] (03PS1) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [12:51:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:51:55] (03CR) 10Muehlenhoff: C:ldap::client::utils Rewrite add-ldap-group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [12:52:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T323907)', diff saved to https://phabricator.wikimedia.org/P41300 and previous config saved to /var/cache/conftool/dbconfig/20221128-125200-ladsgroup.json [12:52:07] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:53:52] (03CR) 10CI reject: [V: 04-1] ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [12:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41301 and previous config saved to /var/cache/conftool/dbconfig/20221128-125351-ladsgroup.json [12:54:39] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:55:26] (03PS6) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [12:56:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321126)', diff saved to https://phabricator.wikimedia.org/P41302 and previous config saved to /var/cache/conftool/dbconfig/20221128-125612-marostegui.json [12:56:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:56:19] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:56:24] (03CR) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [12:56:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:56:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41303 and previous config saved to /var/cache/conftool/dbconfig/20221128-125632-ladsgroup.json [12:59:35] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:36] (03PS25) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323907)', diff saved to https://phabricator.wikimedia.org/P41304 and previous config saved to /var/cache/conftool/dbconfig/20221128-130443-ladsgroup.json [13:04:50] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:04:55] RECOVERY - SSH on mw1320.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:06:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41305 and previous config saved to /var/cache/conftool/dbconfig/20221128-130603-ladsgroup.json [13:06:22] (03PS2) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [13:06:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321126)', diff saved to https://phabricator.wikimedia.org/P41306 and previous config saved to /var/cache/conftool/dbconfig/20221128-130642-marostegui.json [13:06:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:08:36] (03CR) 10CI reject: [V: 04-1] ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [13:08:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41307 and previous config saved to /var/cache/conftool/dbconfig/20221128-130858-ladsgroup.json [13:09:43] (03PS3) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [13:10:03] (03CR) 10Muehlenhoff: C:ldap::client::utils Rewrite add-ldap-group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [13:11:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41308 and previous config saved to /var/cache/conftool/dbconfig/20221128-131138-ladsgroup.json [13:14:05] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:14:29] (03PS7) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [13:16:27] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [13:16:34] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [13:18:41] !log upgrade thanos on thanos-fe2001 - T303154 [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:47] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [13:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41309 and previous config saved to /var/cache/conftool/dbconfig/20221128-131949-ladsgroup.json [13:20:12] !log rebalance Ganeti group B/codfw following reboots [13:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41310 and previous config saved to /var/cache/conftool/dbconfig/20221128-132109-ladsgroup.json [13:21:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10MoritzMuehlenhoff) [13:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P41311 and previous config saved to /var/cache/conftool/dbconfig/20221128-132149-marostegui.json [13:21:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: ganeti203[12] implementation tracking - https://phabricator.wikimedia.org/T313857 (10MoritzMuehlenhoff) 05Open→03Resolved ganeti2031 and ganeti2032 have been added to the codfw Ganeti cluster. [13:21:55] !log upgrade thanos on thanos-fe2* - T303154 [13:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323827)', diff saved to https://phabricator.wikimedia.org/P41312 and previous config saved to /var/cache/conftool/dbconfig/20221128-132404-ladsgroup.json [13:24:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:24:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:24:13] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [13:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T323827)', diff saved to https://phabricator.wikimedia.org/P41313 and previous config saved to /var/cache/conftool/dbconfig/20221128-132415-ladsgroup.json [13:24:52] !log upgrade thanos on prometheus2* - T303154 [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:58] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [13:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323827)', diff saved to https://phabricator.wikimedia.org/P41314 and previous config saved to /var/cache/conftool/dbconfig/20221128-132645-ladsgroup.json [13:26:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:27:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41315 and previous config saved to /var/cache/conftool/dbconfig/20221128-132706-ladsgroup.json [13:27:18] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [13:27:23] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [13:27:44] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [13:27:51] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [13:31:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:32:04] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad [13:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41316 and previous config saved to /var/cache/conftool/dbconfig/20221128-133456-ladsgroup.json [13:36:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41317 and previous config saved to /var/cache/conftool/dbconfig/20221128-133615-ladsgroup.json [13:36:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [13:36:22] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [13:36:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [13:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41318 and previous config saved to /var/cache/conftool/dbconfig/20221128-133648-ladsgroup.json [13:36:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/860909 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:36:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P41319 and previous config saved to /var/cache/conftool/dbconfig/20221128-133655-marostegui.json [13:38:30] (03CR) 10Btullis: "Can you do a a PCC run please, before we merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [13:40:43] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10lmata) @Marostegui: Thank you for following up, I missed your earlier ping. Reading T26... [13:41:22] (03CR) 10Klausman: Add basic rate-limit capabilities to ML clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) (owner: 10Elukey) [13:42:39] (03CR) 10Klausman: [C: 03+1] knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [13:45:56] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [13:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41320 and previous config saved to /var/cache/conftool/dbconfig/20221128-134635-ladsgroup.json [13:46:41] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [13:47:38] !log restart grafana-server on grafana1002 [13:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:10] sorry for the brief disruption ^ [13:49:22] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [13:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T323907)', diff saved to https://phabricator.wikimedia.org/P41321 and previous config saved to /var/cache/conftool/dbconfig/20221128-135002-ladsgroup.json [13:50:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [13:50:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [13:50:08] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:51:34] !log rebalance Ganeti group C/eqiad T311687 [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:40] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [13:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321126)', diff saved to https://phabricator.wikimedia.org/P41322 and previous config saved to /var/cache/conftool/dbconfig/20221128-135202-marostegui.json [13:52:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:52:08] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:52:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:52:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41323 and previous config saved to /var/cache/conftool/dbconfig/20221128-135223-marostegui.json [13:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323827)', diff saved to https://phabricator.wikimedia.org/P41324 and previous config saved to /var/cache/conftool/dbconfig/20221128-135349-ladsgroup.json [13:53:56] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [13:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41325 and previous config saved to /var/cache/conftool/dbconfig/20221128-135433-marostegui.json [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T1400). [14:00:04] cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:01:15] I’m still having lunch, if nobody else is around I can deploy later in the window (cirno feel free to ping me in, idk, 30 minutes?) [14:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41326 and previous config saved to /var/cache/conftool/dbconfig/20221128-140141-ladsgroup.json [14:04:13] (03CR) 10Jbond: "LGTM comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [14:06:16] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2050.codfw.wmnet with OS bullseye [14:06:23] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [14:07:34] (03PS2) 10Jaime Nuche: create group for Release Engineering members [puppet] - 10https://gerrit.wikimedia.org/r/860836 [14:07:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41327 and previous config saved to /var/cache/conftool/dbconfig/20221128-140855-ladsgroup.json [14:09:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P41328 and previous config saved to /var/cache/conftool/dbconfig/20221128-140939-marostegui.json [14:09:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:10:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T323907)', diff saved to https://phabricator.wikimedia.org/P41329 and previous config saved to /var/cache/conftool/dbconfig/20221128-141016-ladsgroup.json [14:10:23] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:10:29] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10Marostegui) p:05Triage→03Medium @Asaf from what I can see you are already part of the wmf LDAP group. Not sure if you need something else apart - @Ottomata is there anything else required to access... [14:10:51] o/ [14:11:04] Lucas_WMDE: sorry I missed the ping [14:12:05] (03CR) 10Muehlenhoff: ldap:management rewrite modify-mfa to use Bitu. (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [14:12:13] (03PS2) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) [14:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:05] (03PS3) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) [14:15:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) @BTullis we just received batteries. When would work best for you I would like to do them this week if... [14:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41330 and previous config saved to /var/cache/conftool/dbconfig/20221128-141648-ladsgroup.json [14:19:09] (03CR) 10Elukey: Add basic rate-limit capabilities to ML clusters (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) (owner: 10Elukey) [14:19:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) >>! In T318659#8424520, @Jclark-ctr wrote: > @BTullis we just received batteries. When would work best for... [14:20:10] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10MoritzMuehlenhoff) If Asaf needs Supetset access to private tables he needs to be added to the analytics-privatedata-users group, but without an SSH key, see https://wikitech.wikimedia.org/wiki/Analyti... [14:20:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) Yea I am on site right now Let me know when they are ready for me [14:21:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323907)', diff saved to https://phabricator.wikimedia.org/P41331 and previous config saved to /var/cache/conftool/dbconfig/20221128-142107-ladsgroup.json [14:21:14] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:22:11] * Lucas_WMDE back [14:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41332 and previous config saved to /var/cache/conftool/dbconfig/20221128-142402-ladsgroup.json [14:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P41333 and previous config saved to /var/cache/conftool/dbconfig/20221128-142446-marostegui.json [14:25:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860974 (https://phabricator.wikimedia.org/T323734) (owner: 10Stang) [14:25:45] cirno: ^ [14:25:49] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [14:25:58] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [14:26:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10Marostegui) @Ottomata and @odimitrijevic could you approve the access to `analytics-privatedata-users` @Asaf could you get your manager (Simona, per namely) to approve this too?. I don't see them on ph... [14:26:23] (03Merged) 10jenkins-bot: wikidatawiki: Add ne language logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860974 (https://phabricator.wikimedia.org/T323734) (owner: 10Stang) [14:26:27] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [14:26:36] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:860974|wikidatawiki: Add ne language logo variant (T323734)]] [14:26:42] T323734: Move language-specific logos from Commons.css to logos.php at wikidatawiki - https://phabricator.wikimedia.org/T323734 [14:27:36] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:860974|wikidatawiki: Add ne language logo variant (T323734)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:27:43] cirno: please test [14:28:03] https://www.wikidata.org/?uselang=ne on mwdebug looks good to me (after a force-reload) [14:28:09] Lucas_WMDE: tested with ?uselang=ne and it looks good to me [14:28:14] yay, thanks [14:28:20] syncing [14:28:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:29:54] trwikimedia change also looks good to me on Gerrit, I’ll deploy that afterwards [14:30:59] (03CR) 10Herron: [C: 03+1] hieradata: pool graphite1005 for reads [puppet] - 10https://gerrit.wikimedia.org/r/860522 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [14:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323827)', diff saved to https://phabricator.wikimedia.org/P41334 and previous config saved to /var/cache/conftool/dbconfig/20221128-143154-ladsgroup.json [14:31:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2126.codfw.wmnet with reason: Maintenance [14:32:01] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [14:32:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2126.codfw.wmnet with reason: Maintenance [14:32:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on db2095.codfw.wmnet with reason: Maintenance [14:32:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2095.codfw.wmnet with reason: Maintenance [14:32:25] (03PS2) 10Lucas Werkmeister (WMDE): trwikimedia: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860975 (https://phabricator.wikimedia.org/T323850) (owner: 10Stang) [14:32:28] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:860974|wikidatawiki: Add ne language logo variant (T323734)]] (duration: 05m 52s) [14:32:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41335 and previous config saved to /var/cache/conftool/dbconfig/20221128-143231-ladsgroup.json [14:32:35] T323734: Move language-specific logos from Commons.css to logos.php at wikidatawiki - https://phabricator.wikimedia.org/T323734 [14:33:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:33:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:33:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860975 (https://phabricator.wikimedia.org/T323850) (owner: 10Stang) [14:33:47] now we’ll find out if `scap backport` automatically purges the PNGs or if I still need to do that manually [14:33:48] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: unify ulsfo LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/860930 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [14:33:53] (I suspect the latter, but I’m ready to be surprised ;) ) [14:34:16] (03Merged) 10jenkins-bot: trwikimedia: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860975 (https://phabricator.wikimedia.org/T323850) (owner: 10Stang) [14:34:28] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:860975|trwikimedia: Update logo (T323850)]] [14:34:34] T323850: Change the logo of Wikimedia Turkey on tr.wikimedia.org - https://phabricator.wikimedia.org/T323850 [14:35:22] !log rebalance Ganeti group D/eqiad T311687 [14:35:25] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:860975|trwikimedia: Update logo (T323850)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:27] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:35:38] cirno: please test [14:35:48] (looks good on my end, I think) [14:35:52] Lucas_WMDE: looks good to me [14:36:03] syncing [14:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41336 and previous config saved to /var/cache/conftool/dbconfig/20221128-143613-ladsgroup.json [14:36:24] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Jclark-ctr) db1206 B1 U36 Port 26 Cableid 3285 [14:36:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Jclark-ctr) [14:36:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [14:36:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:37:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [14:39:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323827)', diff saved to https://phabricator.wikimedia.org/P41337 and previous config saved to /var/cache/conftool/dbconfig/20221128-143908-ladsgroup.json [14:39:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:39:16] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [14:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:39:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321126)', diff saved to https://phabricator.wikimedia.org/P41338 and previous config saved to /var/cache/conftool/dbconfig/20221128-143952-marostegui.json [14:39:53] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:860975|trwikimedia: Update logo (T323850)]] (duration: 05m 24s) [14:39:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:39:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:40:04] T323850: Change the logo of Wikimedia Turkey on tr.wikimedia.org - https://phabricator.wikimedia.org/T323850 [14:40:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:40:10] looks like it needs to be purged manually, one sec [14:40:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:40:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T321126)', diff saved to https://phabricator.wikimedia.org/P41339 and previous config saved to /var/cache/conftool/dbconfig/20221128-144029-marostegui.json [14:40:42] (03CR) 10FNegri: [C: 03+1] harbor: remove support for !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41340 and previous config saved to /var/cache/conftool/dbconfig/20221128-144050-ladsgroup.json [14:41:11] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/trwikimedia%s.png\n' '' '-1.5x' '-2x' | mwscript purgeList.php # T323850 [14:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:35] anything else to deploy? [14:41:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:42:21] that's all from me [14:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321126)', diff saved to https://phabricator.wikimedia.org/P41341 and previous config saved to /var/cache/conftool/dbconfig/20221128-144239-marostegui.json [14:42:53] !log UTC afternoon backport+config window done [14:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:37] (03CR) 10David Caro: [C: 03+1] "👍" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [14:44:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41342 and previous config saved to /var/cache/conftool/dbconfig/20221128-144435-ladsgroup.json [14:44:42] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [14:44:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:44:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:45:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:48:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41343 and previous config saved to /var/cache/conftool/dbconfig/20221128-145120-ladsgroup.json [14:52:52] (03PS1) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [14:53:38] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [14:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41344 and previous config saved to /var/cache/conftool/dbconfig/20221128-145556-ladsgroup.json [14:57:12] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [14:57:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P41345 and previous config saved to /var/cache/conftool/dbconfig/20221128-145745-marostegui.json [14:57:54] (03CR) 10Muehlenhoff: create group for Release Engineering members (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860836 (owner: 10Jaime Nuche) [14:58:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:59:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41346 and previous config saved to /var/cache/conftool/dbconfig/20221128-145942-ladsgroup.json [15:00:46] (03CR) 10Hnowlan: [C: 03+2] thumbor: Correct paths for 3d2png and tinyrgb [deployment-charts] - 10https://gerrit.wikimedia.org/r/861383 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [15:06:05] (03Merged) 10jenkins-bot: thumbor: Correct paths for 3d2png and tinyrgb [deployment-charts] - 10https://gerrit.wikimedia.org/r/861383 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan) [15:06:16] (ThanosSidecarBucketOperationsFailed) firing: (3) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [15:06:20] 10SRE, 10Epic: Encrypt all the things - https://phabricator.wikimedia.org/T111653 (10LSobanski) 05Open→03Resolved a:03LSobanski The remaining two open action items are assigned to specific teams and the value of this task is limited so I'm resolving it. [15:06:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:06:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T323907)', diff saved to https://phabricator.wikimedia.org/P41347 and previous config saved to /var/cache/conftool/dbconfig/20221128-150626-ladsgroup.json [15:06:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:06:35] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:06:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:06:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T323827)', diff saved to https://phabricator.wikimedia.org/P41348 and previous config saved to /var/cache/conftool/dbconfig/20221128-150643-ladsgroup.json [15:06:54] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:06:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T323907)', diff saved to https://phabricator.wikimedia.org/P41349 and previous config saved to /var/cache/conftool/dbconfig/20221128-150654-ladsgroup.json [15:07:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [15:09:49] 10SRE, 10Infrastructure-Foundations: Ferm should log errors when failing to create all configured rules - https://phabricator.wikimedia.org/T237020 (10LSobanski) [15:10:45] 10SRE, 10SRE Observability: Important nagios-nrpe-server errors not showing up in unit journal - https://phabricator.wikimedia.org/T237236 (10LSobanski) [15:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41350 and previous config saved to /var/cache/conftool/dbconfig/20221128-151103-ladsgroup.json [15:11:16] (ThanosSidecarBucketOperationsFailed) firing: (10) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [15:12:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P41351 and previous config saved to /var/cache/conftool/dbconfig/20221128-151252-marostegui.json [15:12:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:24] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2050.codfw.wmnet with OS bullseye [15:13:29] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [15:14:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41352 and previous config saved to /var/cache/conftool/dbconfig/20221128-151448-ladsgroup.json [15:15:44] (03PS1) 10Filippo Giunchedi: Add thanos-web.svc and discovery [dns] - 10https://gerrit.wikimedia.org/r/861396 (https://phabricator.wikimedia.org/T323913) [15:16:15] looking into the thanos alert [15:17:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10Joe) The most obvious thing to me is to include the file in the thumbor docker image. It's ok to have a small binary that doesn't change much in it. [15:18:32] (03PS3) 10Giuseppe Lavagetto: thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 [15:19:24] (03PS1) 10Dbrant: Enable shared Reading Lists landing page on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) [15:23:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [15:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:33] 10SRE, 10Cloud-Services, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987 (10LSobanski) silver.wikimedia.org seems to be long gone and the arguments in the task so far don't make me feel strongly about setting u... [15:25:48] 10SRE, 10Cloud-Services, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987 (10LSobanski) 05Open→03Resolved a:03LSobanski [15:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323827)', diff saved to https://phabricator.wikimedia.org/P41353 and previous config saved to /var/cache/conftool/dbconfig/20221128-152609-ladsgroup.json [15:26:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:26:16] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:26:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:26:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41354 and previous config saved to /var/cache/conftool/dbconfig/20221128-152631-ladsgroup.json [15:27:27] 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Kelson) The reported bug seems indeed to have "vanished". Thank you for the good work. [15:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321126)', diff saved to https://phabricator.wikimedia.org/P41355 and previous config saved to /var/cache/conftool/dbconfig/20221128-152758-marostegui.json [15:28:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:28:05] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:28:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T321126)', diff saved to https://phabricator.wikimedia.org/P41356 and previous config saved to /var/cache/conftool/dbconfig/20221128-152820-marostegui.json [15:28:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 (owner: 10Giuseppe Lavagetto) [15:29:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323827)', diff saved to https://phabricator.wikimedia.org/P41357 and previous config saved to /var/cache/conftool/dbconfig/20221128-152955-ladsgroup.json [15:29:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:30:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41358 and previous config saved to /var/cache/conftool/dbconfig/20221128-153016-ladsgroup.json [15:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321126)', diff saved to https://phabricator.wikimedia.org/P41359 and previous config saved to /var/cache/conftool/dbconfig/20221128-153029-marostegui.json [15:32:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:45] !log revert back to thanos 0.21 - T303154 [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:51] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [15:34:07] (03Merged) 10jenkins-bot: thumbor: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860711 (owner: 10Giuseppe Lavagetto) [15:34:34] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad [15:34:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Ottomata) Hi, this sounds like an issue with your ssh config and your ssh key. If your key is configured correctly, ssh should not prompt you for a passw... [15:35:51] (03CR) 10MSantos: [C: 03+1] maps: remove Cassandra and Tilerator service [puppet] - 10https://gerrit.wikimedia.org/r/860634 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:35:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10Ottomata) > Wenjun's access is ssh-less access to analytics-privatedata-users group, right? If so, to remove their public key from the task description Correct. [15:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323827)', diff saved to https://phabricator.wikimedia.org/P41360 and previous config saved to /var/cache/conftool/dbconfig/20221128-153628-ladsgroup.json [15:36:35] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:37:21] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:37:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:38:10] (03PS1) 10Elukey: knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) [15:38:23] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:38:56] (03PS2) 10Elukey: knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) [15:38:58] (03CR) 10CI reject: [V: 04-1] knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [15:39:04] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41361 and previous config saved to /var/cache/conftool/dbconfig/20221128-153916-ladsgroup.json [15:39:32] (03PS1) 10Klausman: (WIP) API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) [15:41:01] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: allow for less frequent backups [puppet] - 10https://gerrit.wikimedia.org/r/858659 (https://phabricator.wikimedia.org/T306200) (owner: 10Andrew Bogott) [15:41:13] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:41:25] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:42:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323907)', diff saved to https://phabricator.wikimedia.org/P41362 and previous config saved to /var/cache/conftool/dbconfig/20221128-154234-ladsgroup.json [15:42:41] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:43:49] (ThanosSidecarBucketOperationsFailed) resolved: (10) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [15:44:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41363 and previous config saved to /var/cache/conftool/dbconfig/20221128-154404-ladsgroup.json [15:44:11] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [15:44:55] (03CR) 10Jbond: "lgtm some minor nits/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [15:45:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P41364 and previous config saved to /var/cache/conftool/dbconfig/20221128-154536-marostegui.json [15:46:34] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10Ottomata) Approve! [15:46:39] (03PS1) 10Muehlenhoff: Update partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/861405 [15:50:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:02] 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) [15:51:14] 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) p:05Triage→03High [15:51:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:34] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Papaul) p:05Triage→03Medium [15:51:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41365 and previous config saved to /var/cache/conftool/dbconfig/20221128-155135-ladsgroup.json [15:52:46] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:52:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 9.029 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:34] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Papaul) @MoritzMuehlenhoff unfortunately this server is out of warranty. [15:53:54] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [15:54:01] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [15:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41366 and previous config saved to /var/cache/conftool/dbconfig/20221128-155423-ladsgroup.json [15:56:59] (03PS1) 10Muehlenhoff: Set role_contacts for failoid to SRE IF [puppet] - 10https://gerrit.wikimedia.org/r/861409 [15:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41367 and previous config saved to /var/cache/conftool/dbconfig/20221128-155740-ladsgroup.json [15:58:26] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861409 (owner: 10Muehlenhoff) [15:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41368 and previous config saved to /var/cache/conftool/dbconfig/20221128-155910-ladsgroup.json [15:59:12] (03PS1) 10Filippo Giunchedi: conftool: add thanos-web service [puppet] - 10https://gerrit.wikimedia.org/r/861411 (https://phabricator.wikimedia.org/T323913) [15:59:18] (03PS1) 10Filippo Giunchedi: thanos: add thanos-web to catalog and frontend [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) [16:00:40] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [16:00:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P41369 and previous config saved to /var/cache/conftool/dbconfig/20221128-160042-marostegui.json [16:00:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Htriedman) @andrea.denisse that is correct! 2023-06-30 is the expiry date for @dasm [16:00:47] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [16:01:08] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [16:01:19] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [16:01:50] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:02:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:03:57] (03PS3) 10Elukey: knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) [16:04:48] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/860591 [16:06:25] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:06:35] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Papaul) I tested the HW on the server all looking good. The only error i had was error-code 2000-0251 which is not a big issue see link below for more information on error-code. I think the task can be closed. Thanks.... [16:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41370 and previous config saved to /var/cache/conftool/dbconfig/20221128-160641-ladsgroup.json [16:08:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [16:08:47] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) Thank you Papaul, I will get this host back to the load balancer and then close the task. [16:09:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41371 and previous config saved to /var/cache/conftool/dbconfig/20221128-160929-ladsgroup.json [16:12:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41372 and previous config saved to /var/cache/conftool/dbconfig/20221128-161247-ladsgroup.json [16:12:52] (03PS2) 10Muehlenhoff: Set role_contacts for failoid to SRE IF [puppet] - 10https://gerrit.wikimedia.org/r/861409 [16:14:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41373 and previous config saved to /var/cache/conftool/dbconfig/20221128-161417-ladsgroup.json [16:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321126)', diff saved to https://phabricator.wikimedia.org/P41374 and previous config saved to /var/cache/conftool/dbconfig/20221128-161549-marostegui.json [16:15:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:16:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [16:16:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321126)', diff saved to https://phabricator.wikimedia.org/P41375 and previous config saved to /var/cache/conftool/dbconfig/20221128-161610-marostegui.json [16:16:56] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321126)', diff saved to https://phabricator.wikimedia.org/P41376 and previous config saved to /var/cache/conftool/dbconfig/20221128-161820-marostegui.json [16:19:02] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [16:21:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:21:40] (03PS1) 10Jbond: swift_disks: update for new partioning schema [puppet] - 10https://gerrit.wikimedia.org/r/861424 [16:21:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323827)', diff saved to https://phabricator.wikimedia.org/P41377 and previous config saved to /var/cache/conftool/dbconfig/20221128-162148-ladsgroup.json [16:21:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:21:55] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:22:07] (03CR) 10Jbond: [C: 03+2] swift_disks: update for new partioning schema [puppet] - 10https://gerrit.wikimedia.org/r/861424 (owner: 10Jbond) [16:22:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:22:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:22:32] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [16:22:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] swift_disks: update for new partioning schema [puppet] - 10https://gerrit.wikimedia.org/r/861424 (owner: 10Jbond) [16:22:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:22:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41378 and previous config saved to /var/cache/conftool/dbconfig/20221128-162246-ladsgroup.json [16:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323827)', diff saved to https://phabricator.wikimedia.org/P41379 and previous config saved to /var/cache/conftool/dbconfig/20221128-162436-ladsgroup.json [16:24:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:24:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:25:13] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2050.codfw.wmnet with OS bullseye [16:25:20] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [16:26:17] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) a:03LSobanski [16:27:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T323907)', diff saved to https://phabricator.wikimedia.org/P41380 and previous config saved to /var/cache/conftool/dbconfig/20221128-162753-ladsgroup.json [16:27:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:28:03] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:28:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T323907)', diff saved to https://phabricator.wikimedia.org/P41381 and previous config saved to /var/cache/conftool/dbconfig/20221128-162815-ladsgroup.json [16:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41382 and previous config saved to /var/cache/conftool/dbconfig/20221128-162923-ladsgroup.json [16:29:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2148.codfw.wmnet with reason: Maintenance [16:29:30] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:29:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2148.codfw.wmnet with reason: Maintenance [16:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41383 and previous config saved to /var/cache/conftool/dbconfig/20221128-162945-ladsgroup.json [16:29:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861426 (https://phabricator.wikimedia.org/T128546) [16:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T1630). [16:32:37] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861426 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:33:19] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861426 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P41384 and previous config saved to /var/cache/conftool/dbconfig/20221128-163328-marostegui.json [16:34:20] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [16:34:27] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [16:37:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:37:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:38:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:38:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41385 and previous config saved to /var/cache/conftool/dbconfig/20221128-163850-ladsgroup.json [16:39:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41386 and previous config saved to /var/cache/conftool/dbconfig/20221128-163859-ladsgroup.json [16:39:32] (03CR) 10Muehlenhoff: [C: 03+2] Set role_contacts for failoid to SRE IF [puppet] - 10https://gerrit.wikimedia.org/r/861409 (owner: 10Muehlenhoff) [16:39:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:39:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:39:46] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:39:52] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:856611| Bumping portals to master (T128546)]] (duration: 04m 33s) [16:40:17] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:43:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:44:21] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:856611| Bumping portals to master (T128546)]] (duration: 04m 28s) [16:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41387 and previous config saved to /var/cache/conftool/dbconfig/20221128-164646-ladsgroup.json [16:46:58] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:47:09] (03PS2) 10Jbond: install_server: migrate ms-bs_simple top GPT [puppet] - 10https://gerrit.wikimedia.org/r/860581 (https://phabricator.wikimedia.org/T308677) [16:47:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:48:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P41388 and previous config saved to /var/cache/conftool/dbconfig/20221128-164834-marostegui.json [16:48:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:52:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:52:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:53:51] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2050.codfw.wmnet with OS bullseye [16:53:57] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [16:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41389 and previous config saved to /var/cache/conftool/dbconfig/20221128-165406-ladsgroup.json [16:54:36] (03PS3) 10Jbond: install_server: migrate ms-bs_simple top GPT [puppet] - 10https://gerrit.wikimedia.org/r/860581 (https://phabricator.wikimedia.org/T308677) [16:55:09] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [16:55:17] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [16:56:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:56:54] (03CR) 10Andrew Bogott: [C: 03+2] neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [16:56:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41390 and previous config saved to /var/cache/conftool/dbconfig/20221128-165654-ladsgroup.json [16:57:01] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [16:57:26] (03CR) 10Andrew Bogott: [C: 03+2] Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [16:57:39] (03PS8) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [16:58:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:59:17] (03PS2) 10Andrew Bogott: glance: use memcached for token caching [puppet] - 10https://gerrit.wikimedia.org/r/858651 (https://phabricator.wikimedia.org/T323319) [17:01:01] (03PS1) 10Marostegui: control-mariadb-client-10.4-bullseye: Back to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861428 (https://phabricator.wikimedia.org/T323928) [17:01:36] (03CR) 10Marostegui: [C: 03+2] control-mariadb-client-10.4-bullseye: Back to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861428 (https://phabricator.wikimedia.org/T323928) (owner: 10Marostegui) [17:01:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41391 and previous config saved to /var/cache/conftool/dbconfig/20221128-170153-ladsgroup.json [17:02:06] (03Merged) 10jenkins-bot: control-mariadb-client-10.4-bullseye: Back to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861428 (https://phabricator.wikimedia.org/T323928) (owner: 10Marostegui) [17:02:51] (03CR) 10Andrew Bogott: [C: 03+2] glance: use memcached for token caching [puppet] - 10https://gerrit.wikimedia.org/r/858651 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [17:03:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321126)', diff saved to https://phabricator.wikimedia.org/P41392 and previous config saved to /var/cache/conftool/dbconfig/20221128-170340-marostegui.json [17:03:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:03:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:03:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:03:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2099.codfw.wmnet with reason: Maintenance [17:04:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2099.codfw.wmnet with reason: Maintenance [17:04:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2106.codfw.wmnet with reason: Maintenance [17:04:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2106.codfw.wmnet with reason: Maintenance [17:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41393 and previous config saved to /var/cache/conftool/dbconfig/20221128-170438-marostegui.json [17:05:04] (03CR) 10Andrew Bogott: [C: 03+2] Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott) [17:06:00] (03CR) 10Andrew Bogott: [C: 03+2] trove: remove network_label_regex [puppet] - 10https://gerrit.wikimedia.org/r/858655 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [17:06:12] (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: lock_path to oslo_concurrency [puppet] - 10https://gerrit.wikimedia.org/r/858653 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [17:06:38] (03CR) 10Andrew Bogott: [C: 03+2] cinder: remove default quota settings [puppet] - 10https://gerrit.wikimedia.org/r/858654 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [17:06:51] (03PS2) 10Andrew Bogott: cinder.conf: lock_path to oslo_concurrency [puppet] - 10https://gerrit.wikimedia.org/r/858653 (https://phabricator.wikimedia.org/T323319) [17:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41394 and previous config saved to /var/cache/conftool/dbconfig/20221128-170651-marostegui.json [17:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41395 and previous config saved to /var/cache/conftool/dbconfig/20221128-170912-ladsgroup.json [17:09:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41396 and previous config saved to /var/cache/conftool/dbconfig/20221128-171200-ladsgroup.json [17:13:38] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [17:13:49] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on mc-wf2001.codfw.wmnet with reason: Kernel upgrade [17:14:03] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mc-wf2001.codfw.wmnet with reason: Kernel upgrade [17:14:10] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on mc-wf2002.codfw.wmnet with reason: Kernel upgrade [17:14:23] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mc-wf2002.codfw.wmnet with reason: Kernel upgrade [17:15:03] (03PS2) 10Elukey: Add basic rate-limit capabilities to ML clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) [17:15:45] (03PS1) 10Jbond: wmflib: update xfs partitions to 4/5 after conversion to GPT [puppet] - 10https://gerrit.wikimedia.org/r/861429 [17:15:47] (03CR) 10Elukey: Add basic rate-limit capabilities to ML clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) (owner: 10Elukey) [17:16:58] (03CR) 10Jbond: [C: 03+2] install_server: migrate ms-bs_simple top GPT [puppet] - 10https://gerrit.wikimedia.org/r/860581 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41397 and previous config saved to /var/cache/conftool/dbconfig/20221128-171659-ladsgroup.json [17:17:07] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [17:17:56] (03CR) 10Jbond: [C: 03+2] wmflib: update xfs partitions to 4/5 after conversion to GPT [puppet] - 10https://gerrit.wikimedia.org/r/861429 (owner: 10Jbond) [17:19:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmflib: update xfs partitions to 4/5 after conversion to GPT [puppet] - 10https://gerrit.wikimedia.org/r/861429 (owner: 10Jbond) [17:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323907)', diff saved to https://phabricator.wikimedia.org/P41398 and previous config saved to /var/cache/conftool/dbconfig/20221128-171911-ladsgroup.json [17:19:20] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [17:20:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:20:52] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2050.codfw.wmnet with OS bullseye [17:20:59] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [17:21:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:21:25] (03PS1) 10Andrew Bogott: cinder: update volume_type_access.py.patch to resemble upstream patch [puppet] - 10https://gerrit.wikimedia.org/r/861430 (https://phabricator.wikimedia.org/T301949) [17:21:34] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2050.codfw.wmnet with OS bullseye [17:21:43] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond... [17:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P41399 and previous config saved to /var/cache/conftool/dbconfig/20221128-172157-marostegui.json [17:22:12] (03PS2) 10Andrew Bogott: trove: remove network_label_regex [puppet] - 10https://gerrit.wikimedia.org/r/858655 (https://phabricator.wikimedia.org/T323319) [17:22:25] (03CR) 10Andrew Bogott: [C: 03+2] cinder: update volume_type_access.py.patch to resemble upstream patch [puppet] - 10https://gerrit.wikimedia.org/r/861430 (https://phabricator.wikimedia.org/T301949) (owner: 10Andrew Bogott) [17:22:33] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) [17:22:47] (03PS2) 10Andrew Bogott: cinder: remove default quota settings [puppet] - 10https://gerrit.wikimedia.org/r/858654 (https://phabricator.wikimedia.org/T323319) [17:22:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:23:40] (03PS1) 10Sohom Datta: Enable limited width on plwikisource MAIN namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) [17:23:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:24:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41400 and previous config saved to /var/cache/conftool/dbconfig/20221128-172419-ladsgroup.json [17:24:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:24:28] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:24:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T323827)', diff saved to https://phabricator.wikimedia.org/P41401 and previous config saved to /var/cache/conftool/dbconfig/20221128-172442-ladsgroup.json [17:26:55] (03CR) 10Sohom Datta: "I'll be free on Nov 30th/Dec 1st during the morning backport, but feel free to deploy before that as well if required 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [17:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41402 and previous config saved to /var/cache/conftool/dbconfig/20221128-172707-ladsgroup.json [17:27:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:29:40] (03PS1) 10Arturo Borrero Gonzalez: wmcs: libs: openstack: fix host_list regex [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 [17:31:35] jouncebot: noandnext [17:31:45] jouncebot: nowandnext [17:31:46] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [17:31:46] In 0 hour(s) and 28 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T1800) [17:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323827)', diff saved to https://phabricator.wikimedia.org/P41403 and previous config saved to /var/cache/conftool/dbconfig/20221128-173206-ladsgroup.json [17:32:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2170.codfw.wmnet with reason: Maintenance [17:32:15] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:32:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2170.codfw.wmnet with reason: Maintenance [17:32:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41404 and previous config saved to /var/cache/conftool/dbconfig/20221128-173227-ladsgroup.json [17:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41405 and previous config saved to /var/cache/conftool/dbconfig/20221128-173418-ladsgroup.json [17:35:26] !log jnuche@deploy1002 Installing scap version "4.29.2" for 558 hosts [17:35:53] !log jnuche@deploy1002 Installation of scap version "4.29.2" completed for 558 hosts [17:36:30] 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration, 10serviceops: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10jijiki) @Silvan_WMDE sorry for not replying sooner, I will take a look at this when I find... [17:36:45] 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration, 10serviceops: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10jijiki) a:03jijiki [17:37:02] 10SRE, 10Observability-Alerting: Important nagios-nrpe-server errors not showing up in unit journal - https://phabricator.wikimedia.org/T237236 (10lmata) [17:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P41406 and previous config saved to /var/cache/conftool/dbconfig/20221128-173704-marostegui.json [17:38:20] (03CR) 10David Caro: "Got a question there" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [17:39:46] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [17:40:19] (03PS1) 10Andrew Bogott: nova: don't specify AvailabilityZoneFilter [puppet] - 10https://gerrit.wikimedia.org/r/861433 (https://phabricator.wikimedia.org/T323319) [17:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41407 and previous config saved to /var/cache/conftool/dbconfig/20221128-174213-ladsgroup.json [17:42:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:42:21] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:42:53] (03CR) 10Andrew Bogott: [C: 03+2] nova: don't specify AvailabilityZoneFilter [puppet] - 10https://gerrit.wikimedia.org/r/861433 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [17:43:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:43:11] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2050.codfw.wmnet with reason: host reimage [17:43:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:43:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41408 and previous config saved to /var/cache/conftool/dbconfig/20221128-174324-ladsgroup.json [17:43:26] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41409 and previous config saved to /var/cache/conftool/dbconfig/20221128-174925-ladsgroup.json [17:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41410 and previous config saved to /var/cache/conftool/dbconfig/20221128-174951-ladsgroup.json [17:49:58] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:50:47] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) [17:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41411 and previous config saved to /var/cache/conftool/dbconfig/20221128-175210-marostegui.json [17:52:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:52:19] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:52:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:52:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T321126)', diff saved to https://phabricator.wikimedia.org/P41412 and previous config saved to /var/cache/conftool/dbconfig/20221128-175232-marostegui.json [17:54:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321126)', diff saved to https://phabricator.wikimedia.org/P41413 and previous config saved to /var/cache/conftool/dbconfig/20221128-175445-marostegui.json [17:54:57] (03CR) 10Majavah: P:openstack: explicit rules for haproxy backend traffic POC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah) [17:54:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323827)', diff saved to https://phabricator.wikimedia.org/P41414 and previous config saved to /var/cache/conftool/dbconfig/20221128-175458-ladsgroup.json [17:55:04] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [17:55:58] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:39] (03CR) 10Arturo Borrero Gonzalez: wmcs: libs: openstack: fix host_list regex (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [17:57:34] (03CR) 10Jdlrobson: Enable shared Reading Lists landing page on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [17:59:56] (03PS2) 10Elukey: knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) [18:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T1800). [18:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41415 and previous config saved to /var/cache/conftool/dbconfig/20221128-180015-ladsgroup.json [18:00:23] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:00:42] (03CR) 10Klausman: [C: 03+1] knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [18:00:49] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2050.codfw.wmnet with OS bullseye [18:00:56] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cum... [18:01:22] (03CR) 10Elukey: "Fixed a little issue with build dependency tracking and added two new docker images, related to new daemons that we'll need to run with 1." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [18:01:45] (03PS2) 10Dbrant: Enable shared Reading Lists landing page on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) [18:03:43] (03CR) 10Dbrant: Enable shared Reading Lists landing page on all wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [18:04:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T323907)', diff saved to https://phabricator.wikimedia.org/P41417 and previous config saved to /var/cache/conftool/dbconfig/20221128-180431-ladsgroup.json [18:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [18:04:38] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:04:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [18:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T323907)', diff saved to https://phabricator.wikimedia.org/P41418 and previous config saved to /var/cache/conftool/dbconfig/20221128-180452-ladsgroup.json [18:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P41419 and previous config saved to /var/cache/conftool/dbconfig/20221128-180458-ladsgroup.json [18:05:47] (03CR) 10Jdlrobson: [C: 03+1] Enable shared Reading Lists landing page on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [18:05:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thank you both :)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:07:52] (03CR) 10David Caro: wmcs: libs: openstack: fix host_list regex (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [18:08:32] (03CR) 10Andrew Bogott: [C: 03+1] "I approve this for the files that I've authored here. Arturo is likely the author of anything that I'm not." [puppet] - 10https://gerrit.wikimedia.org/r/860903 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:08:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on clouddumps1002, dumpsdata1003" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:08:54] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: lib: ensure_canary: fix changelist calculation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861438 [18:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:09:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack/codfw1dev: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860903 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:09:17] (03CR) 10Andrew Bogott: [C: 03+1] Retire obsolete cloudvirt Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [18:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P41420 and previous config saved to /var/cache/conftool/dbconfig/20221128-180951-marostegui.json [18:10:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41421 and previous config saved to /var/cache/conftool/dbconfig/20221128-181004-ladsgroup.json [18:13:17] (03CR) 10Jdlrobson: [C: 03+1] "patch looks good! Feel free to backport whenever is convenient!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta) [18:15:15] anyone around who would like to check on a maintenance script for me? https://phabricator.wikimedia.org/T315510#8392683 [18:15:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41423 and previous config saved to /var/cache/conftool/dbconfig/20221128-181522-ladsgroup.json [18:15:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323907)', diff saved to https://phabricator.wikimedia.org/P41424 and previous config saved to /var/cache/conftool/dbconfig/20221128-181541-ladsgroup.json [18:15:49] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:16:51] MatmaRex: i'll look [18:17:49] (03CR) 10Andrea Denisse: "Hello, the expiry date for the users' access is confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [18:18:09] (03CR) 10Andrea Denisse: [C: 03+2] admin: Add missing email for dpujol. [puppet] - 10https://gerrit.wikimedia.org/r/860945 (https://phabricator.wikimedia.org/T322670) (owner: 10Andrea Denisse) [18:18:41] MatmaRex: the log file in taavi's homesays it started "afwikibooks" and then ends [18:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:19:06] "maintenance job" is normally an actual job running on actual mwmaint. this is a manually started command on the deployment server. [18:19:31] mutante: I realized that about a second after the !log, and moved it to mwmaint1002 [18:19:54] taavi: ah, gotcha!:) [18:19:59] MatmaRex: it's in enwikinews now, says 'Processed 89300 (updated 32726) of 2829596 rows' [18:19:59] mutante: that doesn't seem right, a few days ago folks told me it made it to commonswiki [18:20:02] oh [18:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P41425 and previous config saved to /var/cache/conftool/dbconfig/20221128-182004-ladsgroup.json [18:20:14] well, there you go then :) [18:20:22] okay, thanks! [18:20:26] I did not know the "START" part either [18:20:29] (03PS1) 10Ssingh: cp5002, cp5007: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861439 (https://phabricator.wikimedia.org/T323830) [18:20:31] (03PS1) 10Ssingh: cp5003, cp5008: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861440 (https://phabricator.wikimedia.org/T323830) [18:20:33] (03PS1) 10Ssingh: cp5004, cp5009: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861441 (https://phabricator.wikimedia.org/T323830) [18:20:35] (03PS1) 10Ssingh: cp5005, cp5010: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861442 (https://phabricator.wikimedia.org/T323830) [18:20:37] (03PS1) 10Ssingh: cp5006: decommission host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861443 (https://phabricator.wikimedia.org/T323830) [18:23:43] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10hnowlan) [18:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P41426 and previous config saved to /var/cache/conftool/dbconfig/20221128-182458-marostegui.json [18:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41427 and previous config saved to /var/cache/conftool/dbconfig/20221128-182511-ladsgroup.json [18:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41428 and previous config saved to /var/cache/conftool/dbconfig/20221128-183028-ladsgroup.json [18:30:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41429 and previous config saved to /var/cache/conftool/dbconfig/20221128-183048-ladsgroup.json [18:33:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10andrea.denisse) [18:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41430 and previous config saved to /var/cache/conftool/dbconfig/20221128-183511-ladsgroup.json [18:35:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:35:18] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:35:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:35:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41431 and previous config saved to /var/cache/conftool/dbconfig/20221128-183532-ladsgroup.json [18:35:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:36:13] (03PS4) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) [18:37:03] (03PS4) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) [18:38:00] (03CR) 10Jbond: "This should be ready to go now. i think it would be good to add this node back in and make sure everything works as expected before progr" [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:38:02] (03CR) 10CI reject: [V: 04-1] convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321126)', diff saved to https://phabricator.wikimedia.org/P41432 and previous config saved to /var/cache/conftool/dbconfig/20221128-184004-marostegui.json [18:40:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:40:11] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:40:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323827)', diff saved to https://phabricator.wikimedia.org/P41433 and previous config saved to /var/cache/conftool/dbconfig/20221128-184017-ladsgroup.json [18:40:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2119.codfw.wmnet with reason: Maintenance [18:40:24] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41434 and previous config saved to /var/cache/conftool/dbconfig/20221128-184025-marostegui.json [18:42:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41435 and previous config saved to /var/cache/conftool/dbconfig/20221128-184238-marostegui.json [18:42:49] PROBLEM - Host db2101.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:43:21] 10SRE, 10LDAP-Access-Requests, 10Security-Team: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) [18:43:45] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@276aa70]: relax slas for subgraph and incoming links [18:45:09] 10SRE, 10LDAP-Access-Requests, 10Security-Team: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) @KHurd-WMF - Please create a wikitech username and shell account via https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount [18:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323827)', diff saved to https://phabricator.wikimedia.org/P41436 and previous config saved to /var/cache/conftool/dbconfig/20221128-184535-ladsgroup.json [18:45:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:45:41] (03PS5) 10Jbond: convrt-ssds: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) [18:45:42] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:45:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41437 and previous config saved to /var/cache/conftool/dbconfig/20221128-184554-ladsgroup.json [18:46:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41438 and previous config saved to /var/cache/conftool/dbconfig/20221128-184603-ladsgroup.json [18:46:19] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@276aa70]: relax slas for subgraph and incoming links (duration: 02m 34s) [18:48:18] 10SRE, 10SRE-Access-Requests, 10Security-Team: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [18:48:53] RECOVERY - Host db2101.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms [18:50:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:54:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41439 and previous config saved to /var/cache/conftool/dbconfig/20221128-185420-ladsgroup.json [18:54:28] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:57:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P41440 and previous config saved to /var/cache/conftool/dbconfig/20221128-185745-marostegui.json [19:01:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T323907)', diff saved to https://phabricator.wikimedia.org/P41441 and previous config saved to /var/cache/conftool/dbconfig/20221128-190101-ladsgroup.json [19:01:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [19:01:08] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:01:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [19:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T323907)', diff saved to https://phabricator.wikimedia.org/P41442 and previous config saved to /var/cache/conftool/dbconfig/20221128-190122-ladsgroup.json [19:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41443 and previous config saved to /var/cache/conftool/dbconfig/20221128-190122-ladsgroup.json [19:01:35] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:01:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:06:22] (03PS1) 10Ssingh: P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) [19:07:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38459/console" [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [19:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P41444 and previous config saved to /var/cache/conftool/dbconfig/20221128-190927-ladsgroup.json [19:11:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:12:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323907)', diff saved to https://phabricator.wikimedia.org/P41445 and previous config saved to /var/cache/conftool/dbconfig/20221128-191211-ladsgroup.json [19:12:18] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:12:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P41446 and previous config saved to /var/cache/conftool/dbconfig/20221128-191251-marostegui.json [19:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41447 and previous config saved to /var/cache/conftool/dbconfig/20221128-191629-ladsgroup.json [19:17:34] (03PS1) 10Ottomata: beta - set message_key_fields on stream rc0.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861446 (https://phabricator.wikimedia.org/T318846) [19:18:59] (03CR) 10Ottomata: [C: 03+2] beta - set message_key_fields on stream rc0.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861446 (https://phabricator.wikimedia.org/T318846) (owner: 10Ottomata) [19:23:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:24:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P41448 and previous config saved to /var/cache/conftool/dbconfig/20221128-192433-ladsgroup.json [19:24:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:24:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:25:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:25:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5002,5007].eqsin.wmnet [19:27:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41449 and previous config saved to /var/cache/conftool/dbconfig/20221128-192718-ladsgroup.json [19:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41450 and previous config saved to /var/cache/conftool/dbconfig/20221128-192758-marostegui.json [19:28:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:28:04] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:28:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:28:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T321126)', diff saved to https://phabricator.wikimedia.org/P41451 and previous config saved to /var/cache/conftool/dbconfig/20221128-192830-marostegui.json [19:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321126)', diff saved to https://phabricator.wikimedia.org/P41452 and previous config saved to /var/cache/conftool/dbconfig/20221128-193043-marostegui.json [19:31:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41453 and previous config saved to /var/cache/conftool/dbconfig/20221128-193135-ladsgroup.json [19:31:47] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:34:20] (03CR) 10Bking: [C: 03+2] mjolnir msearch: Reduce allowed concurrency [puppet] - 10https://gerrit.wikimedia.org/r/860129 (https://phabricator.wikimedia.org/T318575) (owner: 10Ebernhardson) [19:35:06] herron: hi [19:35:15] I have some pending thanos changes in the dns cookbook [19:35:26] cwhite: or herron ^ [19:35:42] I think those were godog from earlier let me find the task [19:36:19] ok! [19:37:05] sukhe: if it's for the svc zonefile than merge it away [19:37:07] it's a noop in prod [19:37:23] ok thanks volans! [19:37:25] (03PS2) 10Dzahn: phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:37:38] this diff feature is nice [19:37:41] very much appreciated [19:38:13] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5002,5007].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:39:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323827)', diff saved to https://phabricator.wikimedia.org/P41454 and previous config saved to /var/cache/conftool/dbconfig/20221128-193940-ladsgroup.json [19:39:47] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:41:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5002,5007].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [19:41:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:41:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5002,5007].eqsin.wmnet [19:41:35] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5002,5007].eqsin.wmnet` - cp5002.eqsin.w... [19:41:54] (03CR) 10Ssingh: [C: 03+2] cp5002, cp5007: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861439 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41455 and previous config saved to /var/cache/conftool/dbconfig/20221128-194224-ladsgroup.json [19:44:04] (03PS1) 10Ottomata: rc0.mediawiki.page_change stream - produce with keyed message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861448 (https://phabricator.wikimedia.org/T318846) [19:44:13] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [19:45:02] (03PS2) 10Ottomata: rc0.mediawiki.page_change stream - produce with keyed message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861448 (https://phabricator.wikimedia.org/T318846) [19:45:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P41456 and previous config saved to /var/cache/conftool/dbconfig/20221128-194551-marostegui.json [19:46:10] (03CR) 10Ottomata: [C: 03+2] rc0.mediawiki.page_change stream - produce with keyed message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861448 (https://phabricator.wikimedia.org/T318846) (owner: 10Ottomata) [19:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323827)', diff saved to https://phabricator.wikimedia.org/P41457 and previous config saved to /var/cache/conftool/dbconfig/20221128-194642-ladsgroup.json [19:46:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:46:50] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:46:55] (03Merged) 10jenkins-bot: rc0.mediawiki.page_change stream - produce with keyed message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861448 (https://phabricator.wikimedia.org/T318846) (owner: 10Ottomata) [19:46:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:47:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T323827)', diff saved to https://phabricator.wikimedia.org/P41458 and previous config saved to /var/cache/conftool/dbconfig/20221128-194703-ladsgroup.json [19:47:07] PROBLEM - Confd vcl based reload on cp5016 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:47:15] PROBLEM - Confd vcl based reload on cp5008 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:47:17] PROBLEM - Confd vcl based reload on cp5009 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:47:36] interesting [19:47:37] PROBLEM - Confd vcl based reload on cp5014 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:47:54] PROBLEM - Confd vcl based reload on cp5004 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:48:07] PROBLEM - Confd vcl based reload on cp5015 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:48:15] looking [19:48:29] PROBLEM - Confd vcl based reload on cp5003 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:48:31] PROBLEM - Confd vcl based reload on cp5013 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:48:41] PROBLEM - Confd vcl based reload on cp5011 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:48:45] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:48:53] PROBLEM - Confd vcl based reload on cp5006 is CRITICAL: reload-vcl failed to run since 0h, 5 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:49:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST virtualservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:50:06] bblack: ^ [19:50:21] the confd vcl based reload thing is back, probably stemming from the depool of cp5002 and 5007! [19:50:33] A:cp-eqsin echo OK? :) [19:50:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:51:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:53:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:54:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:57:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T323907)', diff saved to https://phabricator.wikimedia.org/P41459 and previous config saved to /var/cache/conftool/dbconfig/20221128-195731-ladsgroup.json [19:57:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [19:57:38] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:57:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [19:57:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T323907)', diff saved to https://phabricator.wikimedia.org/P41460 and previous config saved to /var/cache/conftool/dbconfig/20221128-195753-ladsgroup.json [20:00:57] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp5028.eqsin.wmnet,service=ats-be [20:00:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P41461 and previous config saved to /var/cache/conftool/dbconfig/20221128-200058-marostegui.json [20:01:09] RECOVERY - Confd vcl based reload on cp5003 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:01:20] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp5028.eqsin.wmnet,service=ats-be [20:04:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:04:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5020.eqsin.wmnet,service=ats-be [20:05:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5020.eqsin.wmnet,service=ats-be [20:05:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323827)', diff saved to https://phabricator.wikimedia.org/P41462 and previous config saved to /var/cache/conftool/dbconfig/20221128-200522-ladsgroup.json [20:05:29] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323907)', diff saved to https://phabricator.wikimedia.org/P41463 and previous config saved to /var/cache/conftool/dbconfig/20221128-200838-ladsgroup.json [20:08:45] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:10:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:11] RECOVERY - Confd vcl based reload on cp5004 is OK: reload-vcl successfully ran 0h, 9 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:13:27] RECOVERY - Confd vcl based reload on cp5008 is OK: reload-vcl successfully ran 0h, 7 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:13:42] (03PS1) 10Ottomata: eventgate - bump version to get keyed message support [deployment-charts] - 10https://gerrit.wikimedia.org/r/861451 (https://phabricator.wikimedia.org/T318846) [20:14:07] (03CR) 10Ottomata: [C: 03+2] eventgate - bump version to get keyed message support [deployment-charts] - 10https://gerrit.wikimedia.org/r/861451 (https://phabricator.wikimedia.org/T318846) (owner: 10Ottomata) [20:14:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:15:48] (03PS1) 10Kghbln: Add ProWiki feed [puppet] - 10https://gerrit.wikimedia.org/r/861452 [20:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321126)', diff saved to https://phabricator.wikimedia.org/P41464 and previous config saved to /var/cache/conftool/dbconfig/20221128-201604-marostegui.json [20:16:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [20:16:13] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:16:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [20:16:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41465 and previous config saved to /var/cache/conftool/dbconfig/20221128-201636-marostegui.json [20:18:19] RECOVERY - Confd vcl based reload on cp5016 is OK: reload-vcl successfully ran 0h, 12 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [20:18:32] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [20:18:34] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [20:18:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41466 and previous config saved to /var/cache/conftool/dbconfig/20221128-201849-marostegui.json [20:19:16] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [20:20:10] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [20:20:16] (03CR) 10Kghbln: "Hi Daniel, according to https://www.mediawiki.org/wiki/Git/Reviewers#operations/puppet you are reviewing the planet. Will be great to get " [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:20:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P41467 and previous config saved to /var/cache/conftool/dbconfig/20221128-202029-ladsgroup.json [20:21:16] (03PS2) 10Dzahn: planet: Add ProWiki feed [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:21:29] PROBLEM - Confd vcl based reload on cp5012 is CRITICAL: reload-vcl failed to run since 0h, 16 minutes. https://wikitech.wikimedia.org/wiki/Varnish [20:21:59] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [20:22:56] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [20:23:05] 10SRE, 10Data Pipelines, 10Data-Engineering-Planning, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10EChetty) [20:23:37] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41468 and previous config saved to /var/cache/conftool/dbconfig/20221128-202345-ladsgroup.json [20:24:14] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [20:24:19] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [20:24:26] (03CR) 10Dzahn: [C: 03+2] "Sure, no problem. per https://wikiindex.org/Jeroen_De_Dauw" [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:25:14] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [20:25:21] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [20:26:13] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [20:26:33] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [20:27:04] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [20:27:13] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [20:27:23] (03CR) 10Dzahn: [C: 03+1] "lgtm. there might be a few more minor authors. not sure where we draw the line sometimes" [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [20:28:03] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [20:28:11] (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for Envoy on planet [puppet] - 10https://gerrit.wikimedia.org/r/860560 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:28:16] (03CR) 10Kghbln: planet: Add ProWiki feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:28:27] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [20:29:12] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [20:29:49] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [20:30:16] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [20:30:29] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [20:31:00] (03CR) 10Dzahn: [C: 03+2] planet: Add ProWiki feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:31:16] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [20:31:31] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [20:32:15] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [20:32:50] (03CR) 10Dzahn: [C: 03+2] "service and timer was created on planet1002. I also tested manually starting it." [puppet] - 10https://gerrit.wikimedia.org/r/860560 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:33:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P41469 and previous config saved to /var/cache/conftool/dbconfig/20221128-203356-marostegui.json [20:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P41470 and previous config saved to /var/cache/conftool/dbconfig/20221128-203535-ladsgroup.json [20:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41471 and previous config saved to /var/cache/conftool/dbconfig/20221128-203851-ladsgroup.json [20:42:16] (03PS1) 10Ottomata: Revert portals to commit 2177e33bdb9db87b01be886161419d604134e0b6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861455 [20:43:41] (03CR) 10Ottomata: [C: 03+2] Revert portals to commit 2177e33bdb9db87b01be886161419d604134e0b6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861455 (owner: 10Ottomata) [20:44:38] (03CR) 10Kghbln: planet: Add ProWiki feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [20:44:43] (03Merged) 10jenkins-bot: Revert portals to commit 2177e33bdb9db87b01be886161419d604134e0b6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861455 (owner: 10Ottomata) [20:48:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P41472 and previous config saved to /var/cache/conftool/dbconfig/20221128-204902-marostegui.json [20:50:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:50:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323827)', diff saved to https://phabricator.wikimedia.org/P41473 and previous config saved to /var/cache/conftool/dbconfig/20221128-205041-ladsgroup.json [20:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:50:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:50:48] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [20:50:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1188.eqiad.wmnet with reason: Maintenance [20:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T323827)', diff saved to https://phabricator.wikimedia.org/P41474 and previous config saved to /var/cache/conftool/dbconfig/20221128-205103-ladsgroup.json [20:51:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:51:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:52:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T323907)', diff saved to https://phabricator.wikimedia.org/P41475 and previous config saved to /var/cache/conftool/dbconfig/20221128-205358-ladsgroup.json [20:54:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:54:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:54:04] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:55:56] (03CR) 10Andrea Denisse: "Hi, I've implemented your suggestions and the PCC results look good to me: https://puppet-compiler.wmflabs.org/output/854951/38447/" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:59:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5003.eqsin.wmnet,service=ats-tls [20:59:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5003.eqsin.wmnet,service=ats-be [20:59:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5003.eqsin.wmnet,service=varnish-fe [21:00:05] RoanKattouw, Urbanecm, cjming, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T2100). [21:00:05] dbrant: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:07] RECOVERY - Confd vcl based reload on cp5006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:01:03] RECOVERY - Confd vcl based reload on cp5014 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:01:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5003.eqsin.wmnet with reason: downtimed, to be depooled [21:01:31] RECOVERY - Confd vcl based reload on cp5013 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:01:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5003.eqsin.wmnet with reason: downtimed, to be depooled [21:01:51] (03PS1) 10BBlack: p::phabricator::main: remove unused $cache_nodes [puppet] - 10https://gerrit.wikimedia.org/r/861460 [21:02:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5008.eqsin.wmnet,service=ats-tls [21:02:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5008.eqsin.wmnet,service=ats-be [21:02:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5008.eqsin.wmnet,service=varnish-fe [21:02:27] RECOVERY - Confd vcl based reload on cp5015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:02:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5008.eqsin.wmnet with reason: downtimed, to be depooled [21:02:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5008.eqsin.wmnet with reason: downtimed, to be depooled [21:02:45] (03CR) 10Ssingh: [C: 03+2] cp5003, cp5008: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861440 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [21:03:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [21:03:09] RECOVERY - Confd vcl based reload on cp5012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:03:36] (03PS2) 10Ssingh: cp5003, cp5008: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861440 (https://phabricator.wikimedia.org/T323830) [21:04:07] RECOVERY - Confd vcl based reload on cp5011 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:04:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41476 and previous config saved to /var/cache/conftool/dbconfig/20221128-210408-marostegui.json [21:04:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:04:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:04:15] RECOVERY - Confd vcl based reload on cp5009 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [21:04:15] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41477 and previous config saved to /var/cache/conftool/dbconfig/20221128-210419-marostegui.json [21:04:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:06:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye [21:06:12] o/ deployers around? [21:06:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [21:06:30] I can deploy [21:06:32] (03CR) 10Dzahn: [C: 03+1] "aha, lgtm, I don't see anything using it. compiler shows it's just removing the parameter and values though: https://puppet-compiler.wmfla" [puppet] - 10https://gerrit.wikimedia.org/r/861460 (owner: 10BBlack) [21:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41478 and previous config saved to /var/cache/conftool/dbconfig/20221128-210632-marostegui.json [21:06:57] (03PS3) 10Clare Ming: Enable shared Reading Lists landing page on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:08:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:09:04] (03CR) 10Dzahn: [C: 03+2] "Oh, it's a great change, thanks!:) The topic branch was just a little detail for me but it's nice to have them grouped. The "planet: " pre" [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [21:09:13] (03Merged) 10jenkins-bot: Enable shared Reading Lists landing page on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861397 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [21:09:27] !log cjming@deploy1002 Started scap: Backport for [[gerrit:861397|Enable shared Reading Lists landing page on all wikis. (T313269)]] [21:09:33] T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269 [21:10:26] (03PS2) 10BBlack: p::phabricator::main: remove unused $cache_nodes [puppet] - 10https://gerrit.wikimedia.org/r/861460 (https://phabricator.wikimedia.org/T270185) [21:10:27] !log cjming@deploy1002 cjming and dbrant: Backport for [[gerrit:861397|Enable shared Reading Lists landing page on all wikis. (T313269)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:10:50] dbrant: up on test servers if you'd like to verify [21:11:17] cjming: confirmed! looks good [21:11:38] cool - syncing [21:11:59] (03CR) 10BBlack: p::phabricator::main: remove unused $cache_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861460 (https://phabricator.wikimedia.org/T270185) (owner: 10BBlack) [21:12:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5003,5008].eqsin.wmnet [21:12:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:13:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:13:48] (03CR) 10BBlack: [C: 03+2] p::phabricator::main: remove unused $cache_nodes [puppet] - 10https://gerrit.wikimedia.org/r/861460 (https://phabricator.wikimedia.org/T270185) (owner: 10BBlack) [21:14:44] (03PS1) 10MSantos: wikifeeds: bump to 2022-11-28-160349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861461 [21:14:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:49] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:861397|Enable shared Reading Lists landing page on all wikis. (T313269)]] (duration: 06m 22s) [21:15:56] T313269: Shareable Reading Lists - https://phabricator.wikimedia.org/T313269 [21:15:59] dbrant: live! [21:16:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:16:23] cjming: excellent, many thanks! [21:16:45] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:45] so welcome! [21:17:15] I'll hang out for a bit longer before closing the backport window [21:18:31] (03PS11) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [21:18:33] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [21:19:31] PROBLEM - Check systemd state on kubernetes2011 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:17] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:20:49] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5003,5008].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [21:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P41479 and previous config saved to /var/cache/conftool/dbconfig/20221128-212138-marostegui.json [21:23:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5003,5008].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [21:23:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:23:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5003,5008].eqsin.wmnet [21:23:14] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5003,5008].eqsin.wmnet` - cp5003.eqsin.w... [21:25:12] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [21:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323827)', diff saved to https://phabricator.wikimedia.org/P41480 and previous config saved to /var/cache/conftool/dbconfig/20221128-212702-ladsgroup.json [21:27:23] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [21:27:39] (03PS1) 10BBlack: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) [21:29:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:33] !log end of UTC late backport window [21:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P41481 and previous config saved to /var/cache/conftool/dbconfig/20221128-213645-marostegui.json [21:39:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:42:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P41482 and previous config saved to /var/cache/conftool/dbconfig/20221128-214208-ladsgroup.json [21:44:32] RECOVERY - Check systemd state on kubernetes2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5004.eqsin.wmnet,service=ats-tls [21:44:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5004.eqsin.wmnet,service=ats-be [21:44:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5004.eqsin.wmnet,service=varnish-fe [21:44:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5009.eqsin.wmnet,service=ats-tls [21:44:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5009.eqsin.wmnet,service=ats-be [21:44:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5009.eqsin.wmnet,service=varnish-fe [21:46:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5004,5009].eqsin.wmnet with reason: downtimed, to be depooled [21:46:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5004,5009].eqsin.wmnet with reason: downtimed, to be depooled [21:47:49] (03PS2) 10Ssingh: cp5004, cp5009: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861441 (https://phabricator.wikimedia.org/T323830) [21:48:38] (03CR) 10Ssingh: [C: 03+2] cp5004, cp5009: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861441 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [21:50:52] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321126)', diff saved to https://phabricator.wikimedia.org/P41483 and previous config saved to /var/cache/conftool/dbconfig/20221128-215151-marostegui.json [21:51:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:51:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:51:59] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:52:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:52:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2147.codfw.wmnet with reason: Maintenance [21:52:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T321126)', diff saved to https://phabricator.wikimedia.org/P41484 and previous config saved to /var/cache/conftool/dbconfig/20221128-215223-marostegui.json [21:54:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321126)', diff saved to https://phabricator.wikimedia.org/P41485 and previous config saved to /var/cache/conftool/dbconfig/20221128-215435-marostegui.json [21:55:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5004,5009].eqsin.wmnet [21:57:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P41486 and previous config saved to /var/cache/conftool/dbconfig/20221128-215715-ladsgroup.json [21:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST virtualservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:00:01] !log phabricator: phab1001 -> phab1004 migration starting soon; downtime expected (T280597) [22:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T2200) [22:00:05] mutante and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator migration to phab1004 . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221128T2200). [22:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:08] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [22:00:40] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250 [22:00:46] T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250 [22:00:56] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250 [22:03:28] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [22:05:07] (03PS2) 10Ssingh: cp5005, cp5010: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861442 (https://phabricator.wikimedia.org/T323830) [22:06:03] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5004,5009].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:07:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5004,5009].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:07:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:19] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cp[5004,5009].eqsin.wmnet [22:08:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host arclamp1001.eqiad.wmnet with OS bullseye [22:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-220944-marostegui.json [22:11:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323827)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-221221-ladsgroup.json [22:12:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:12:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance [22:12:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T323827)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-221242-ladsgroup.json [22:15:43] (03PS1) 10JHathaway: postfix::mx: vrts password [labs/private] - 10https://gerrit.wikimedia.org/r/861487 [22:18:38] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:52] (03PS12) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [22:21:39] (03CR) 10JHathaway: [C: 03+2] postfix::mx: vrts password [labs/private] - 10https://gerrit.wikimedia.org/r/861487 (owner: 10JHathaway) [22:21:41] (03CR) 10JHathaway: [V: 03+2 C: 03+2] postfix::mx: vrts password [labs/private] - 10https://gerrit.wikimedia.org/r/861487 (owner: 10JHathaway) [22:23:54] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [22:24:11] (03CR) 10Brennen Bearnes: [C: 03+1] "Discussed during migration window." [puppet] - 10https://gerrit.wikimedia.org/r/859145 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:24:25] (03CR) 10Dzahn: [C: 03+2] phabricator: set mysql master port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/859145 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:24:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-222450-marostegui.json [22:25:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5005.eqsin.wmnet,service=ats-tls [22:25:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5005.eqsin.wmnet,service=ats-be [22:25:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5005.eqsin.wmnet,service=varnish-fe [22:25:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5010.eqsin.wmnet,service=ats-tls [22:25:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5010.eqsin.wmnet,service=ats-be [22:25:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5010.eqsin.wmnet,service=varnish-fe [22:26:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5005,5010].eqsin.wmnet with reason: downtimed, to be depooled [22:26:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5005,5010].eqsin.wmnet with reason: downtimed, to be depooled [22:26:29] (03CR) 10Ssingh: [C: 03+2] cp5005, cp5010: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861442 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [22:26:57] jhathaway: merging your labs/private change! [22:27:08] sukhe: thanks [22:27:08] er no, not labs/private but [22:27:11] ok [22:30:41] (03CR) 10Brennen Bearnes: [C: 03+1] Revert "Revert "hieradata: switch active Phabricator server to phab1004"" [puppet] - 10https://gerrit.wikimedia.org/r/860031 (owner: 10Dzahn) [22:31:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/860031/38461/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/860031 (owner: 10Dzahn) [22:31:28] (03PS2) 10Dzahn: Revert "Revert "hieradata: switch active Phabricator server to phab1004"" [puppet] - 10https://gerrit.wikimedia.org/r/860031 [22:32:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5005,5010].eqsin.wmnet [22:36:28] (03PS2) 10Ssingh: cp5006: decommission host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861443 (https://phabricator.wikimedia.org/T323830) [22:37:43] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [22:39:32] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5005,5010].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:39:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321126)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-223956-marostegui.json [22:39:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:39:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:40:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2155.codfw.wmnet with reason: Maintenance [22:40:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:40:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [22:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T321126)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-224022-marostegui.json [22:41:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5005,5010].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:41:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:41:20] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cp[5005,5010].eqsin.wmnet [22:42:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet,service=ats-tls [22:42:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet,service=ats-be [22:42:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet,service=varnish-fe [22:42:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5006.eqsin.wmnet with reason: downtimed, to be depooled [22:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321126)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-224235-marostegui.json [22:42:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5006.eqsin.wmnet with reason: downtimed, to be depooled [22:42:51] (03CR) 10Ssingh: [C: 03+2] cp5006: decommission host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/861443 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [22:44:19] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp5006.eqsin.wmnet [22:50:07] (03PS13) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [22:50:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [22:50:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [22:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T323907)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-225101-ladsgroup.json [22:52:08] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [22:53:17] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1001 -> phab1004 (T280597) [22:53:28] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [22:54:10] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1001 -> phab1004 (T280597) (duration: 00m 52s) [22:54:46] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5006.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:54:53] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: let phd run on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:55:01] (03CR) 10Dzahn: [C: 03+2] phabricator: let phd run on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:56:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5006.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [22:56:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:56:34] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cp5006.eqsin.wmnet [22:57:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221128-225741-marostegui.json [22:58:09] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:46] (03PS2) 10Dzahn: Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 [22:59:47] (03PS14) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [22:59:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:00:14] (03CR) 10Brennen Bearnes: [C: 03+1] Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 (owner: 10Dzahn) [23:00:17] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF"" [dns] - 10https://gerrit.wikimedia.org/r/860032 (owner: 10Dzahn) [23:02:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:24] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [23:05:10] (03PS15) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [23:09:52] (03PS1) 10Dzahn: phabricator: quote mysql port numbers [puppet] - 10https://gerrit.wikimedia.org/r/861489 (https://phabricator.wikimedia.org/T280597) [23:09:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:11:44] (03PS2) 10Dzahn: phabricator: quote mysql port numbers [puppet] - 10https://gerrit.wikimedia.org/r/861489 (https://phabricator.wikimedia.org/T280597) [23:12:12] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: quote mysql port numbers [puppet] - 10https://gerrit.wikimedia.org/r/861489 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:12:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:12:23] (03CR) 10Dzahn: [C: 03+2] phabricator: quote mysql port numbers [puppet] - 10https://gerrit.wikimedia.org/r/861489 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:12:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:12:43] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: quote mysql port numbers [puppet] - 10https://gerrit.wikimedia.org/r/861489 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P41487 and previous config saved to /var/cache/conftool/dbconfig/20221128-231247-marostegui.json [23:12:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P41488 and previous config saved to /var/cache/conftool/dbconfig/20221128-231258-ladsgroup.json [23:13:05] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [23:14:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:14:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:14:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:14:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P41489 and previous config saved to /var/cache/conftool/dbconfig/20221128-231426-ladsgroup.json [23:14:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:15:21] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1062 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:15:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:15:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:15:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41490 and previous config saved to /var/cache/conftool/dbconfig/20221128-231548-ladsgroup.json [23:15:55] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:16:17] (03PS1) 10Dzahn: phabricator: change db ports to strings in tools class [puppet] - 10https://gerrit.wikimedia.org/r/861490 (https://phabricator.wikimedia.org/T280597) [23:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323907)', diff saved to https://phabricator.wikimedia.org/P41491 and previous config saved to /var/cache/conftool/dbconfig/20221128-231623-ladsgroup.json [23:16:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T323960 (10phaultfinder) [23:16:58] (03CR) 10Dzahn: [C: 03+2] phabricator: change db ports to strings in tools class [puppet] - 10https://gerrit.wikimedia.org/r/861490 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:17:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: change db ports to strings in tools class [puppet] - 10https://gerrit.wikimedia.org/r/861490 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:17:51] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [23:18:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P41492 and previous config saved to /var/cache/conftool/dbconfig/20221128-231821-ladsgroup.json [23:19:59] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:20:12] (03PS1) 10Dzahn: phabricator: switch mysql slave port for logmail to string [puppet] - 10https://gerrit.wikimedia.org/r/861491 (https://phabricator.wikimedia.org/T280597) [23:20:45] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: switch mysql slave port for logmail to string [puppet] - 10https://gerrit.wikimedia.org/r/861491 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:22:26] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy config changes for mysql-port-as-string (T280597) [23:22:33] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [23:23:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [23:23:21] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for mysql-port-as-string (T280597) (duration: 00m 55s) [23:24:11] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321126)', diff saved to https://phabricator.wikimedia.org/P41493 and previous config saved to /var/cache/conftool/dbconfig/20221128-232754-marostegui.json [23:27:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:28:02] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P41494 and previous config saved to /var/cache/conftool/dbconfig/20221128-232805-ladsgroup.json [23:28:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:28:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T321126)', diff saved to https://phabricator.wikimedia.org/P41495 and previous config saved to /var/cache/conftool/dbconfig/20221128-232815-marostegui.json [23:30:09] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321126)', diff saved to https://phabricator.wikimedia.org/P41496 and previous config saved to /var/cache/conftool/dbconfig/20221128-233028-marostegui.json [23:30:40] (03CR) 10CI reject: [V: 04-1] phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:31:07] (03PS2) 10Dzahn: phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) [23:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41497 and previous config saved to /var/cache/conftool/dbconfig/20221128-233130-ladsgroup.json [23:32:10] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@d361052]: msearch_daemon: Remove cluster selection/load monitor [23:32:53] (03PS3) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) [23:32:58] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [23:33:02] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@d361052]: msearch_daemon: Remove cluster selection/load monitor (duration: 00m 51s) [23:33:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41498 and previous config saved to /var/cache/conftool/dbconfig/20221128-233328-ladsgroup.json [23:33:48] (03CR) 10BBlack: "PCC says no system changes, just expected unused parameter data removal:" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [23:41:40] (03PS1) 10RLazarus: httpbb: Replace URL for metawiki test [puppet] - 10https://gerrit.wikimedia.org/r/861497 (https://phabricator.wikimedia.org/T323707) [23:41:54] (03PS1) 10Dzahn: phabricator: disable phd running on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/861498 (https://phabricator.wikimedia.org/T323418) [23:42:22] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: disable phd running on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/861498 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [23:42:32] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: disable phd running on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/861498 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [23:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P41499 and previous config saved to /var/cache/conftool/dbconfig/20221128-234311-ladsgroup.json [23:45:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P41500 and previous config saved to /var/cache/conftool/dbconfig/20221128-234535-marostegui.json [23:46:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1062 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:46:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41501 and previous config saved to /var/cache/conftool/dbconfig/20221128-234636-ladsgroup.json [23:48:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P41502 and previous config saved to /var/cache/conftool/dbconfig/20221128-234834-ladsgroup.json [23:52:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41503 and previous config saved to /var/cache/conftool/dbconfig/20221128-235223-ladsgroup.json [23:52:30] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P41504 and previous config saved to /var/cache/conftool/dbconfig/20221128-235817-ladsgroup.json [23:58:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:58:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:58:25] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827