[00:16:23] (03CR) 10Tim Starling: [C: 03+1] "Yikes. Approved for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874419 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [00:28:07] (03CR) 10Dreamy Jazz: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [00:46:42] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:09:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:12:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.266 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:20:48] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:33:41] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:37:46] (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:41] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:57:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:16] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:22:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:30] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:07:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.17 [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/873923 (https://phabricator.wikimedia.org/T325580) [03:07:47] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.17 [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/873923 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [03:23:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.17 [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/873923 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [03:35:22] 10SRE, 10Observability-Alerting, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) >>! In T325856#8488229, @MatthewVernon wrote: > [mostly a note for future reference] Fri 23rd is a working day in Europe, so i... [03:36:28] 10SRE, 10Observability-Alerting, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q3): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) [03:41:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:41:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:43:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:43:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 400 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:45:42] 10SRE, 10Observability-Alerting, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q3): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) Changes have been reverted to the business hours configuration in VO (see attached) {F35959108} [03:46:20] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:47:14] 10SRE, 10Observability-Alerting, 10Incident Tooling, 10SRE Observability (FY2022/2023-Q3): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) 05Stalled→03Resolved Resolving, please reopen if there is an issue with this change. [04:01:19] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874446 (https://phabricator.wikimedia.org/T325580) [04:01:21] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874446 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [04:01:58] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874446 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [04:02:28] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.17 refs T325580 [04:02:32] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [04:17:24] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:28:00] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:58:00] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.17 refs T325580 (duration: 55m 31s) [04:58:04] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [05:01:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:06] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:34:50] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:06:34] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:21:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:22:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:26] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:26:40] (03PS1) 10Muehlenhoff: Add Eddie Greiner-Petter to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/874775 [07:28:40] (03CR) 10Muehlenhoff: [C: 03+2] Add Eddie Greiner-Petter to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/874775 (owner: 10Muehlenhoff) [07:31:06] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:34:09] (03PS4) 10Phedenskog: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [07:34:28] (03CR) 10Phedenskog: [C: 03+1] "Great, lets get rid off those." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [07:44:40] PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:36] (03PS3) 10Muehlenhoff: Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813 [07:46:47] (03CR) 10Muehlenhoff: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [07:49:38] (03CR) 10Muehlenhoff: cassandra: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff) [07:50:54] RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:54:58] PROBLEM - SSH on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:57:26] PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:57] (03PS2) 10KartikMistry: Content Translation: Move ttwiki out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869347 (https://phabricator.wikimedia.org/T319177) [08:00:04] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:42] * kart_ is here [08:01:04] * kart_ will self-deploy.. [08:01:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869347 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry) [08:02:31] (03Merged) 10jenkins-bot: Content Translation: Move ttwiki out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869347 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry) [08:02:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:03:37] !log kartik@deploy1002 Started scap: Backport for [[gerrit:869347|Content Translation: Move ttwiki out of Beta (T319177)]] [08:03:41] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:04:39] (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:05:12] PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:05:35] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:869347|Content Translation: Move ttwiki out of Beta (T319177)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:08:31] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:08:37] (03PS10) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [08:08:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto) [08:10:16] PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:10] 1 apache having issue it seems. [08:11:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:26] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:12:03] !log phedenskog@deploy1002 Started deploy [performance/navtiming@4f8c010]: (no justification provided) [08:12:12] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@4f8c010]: (no justification provided) (duration: 00m 08s) [08:12:12] It would be nice to check log: https://pastebin.com/7SzDcYr3 [08:12:40] !log installing Linux 4.19.269 on Buster hosts [08:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:00] PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:35] (03PS1) 10Marostegui: mariadb: Change x1 to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/874776 (https://phabricator.wikimedia.org/T255174) [08:15:43] I'm getting: "ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out" during backport/config deployment. Can anyone check this? [08:15:44] kart_: just had a look, parse1002 has a broken CPU, I'm opening a hardware task and will depool it [08:15:54] moritzm: ah OK. Thanks! [08:16:20] RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:22] PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:54] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet [08:17:34] kart_: can you retry? parse1002 is now removed from the list of hosts it attempts to deploy to [08:17:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:33] moritzm: it is finishing and at php-fpm restart stage. [08:19:41] Let me check. [08:19:47] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:869347|Content Translation: Move ttwiki out of Beta (T319177)]] (duration: 16m 09s) [08:19:47] 10ops-eqiad, 10serviceops: Broken CPU on parse1002 - https://phabricator.wikimedia.org/T326119 (10MoritzMuehlenhoff) [08:19:48] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:50] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:20:08] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:20:20] RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:20:29] Despite failure, I think deployment should be OK. [08:21:25] kart_: ack, I've filed https://phabricator.wikimedia.org/T326119 to get the server fixed and when it gets re-added after the fix, it'll be synced to the latest state of deployment [08:23:26] PROBLEM - Check systemd state on wdqs2011 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:12] Thanks moritzm! [08:24:54] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:25:12] PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:26:10] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:51] (03PS2) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714) [08:32:44] (03PS1) 10Volans: sre.pdus: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/874777 [08:33:34] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:34:36] (03CR) 10Volans: [C: 03+2] sre.pdus: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/874777 (owner: 10Volans) [08:35:42] RECOVERY - Check systemd state on wdqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:13] (03Merged) 10jenkins-bot: sre.pdus: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/874777 (owner: 10Volans) [08:36:56] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:42] RECOVERY - Check systemd state on wdqs2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:14] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Samhaljml) p:05Triage→03Unbreak! [08:44:10] RECOVERY - SSH on wdqs2009 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:36] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Peachey88) p:05Unbreak!→03Triage [08:44:52] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Peachey88) [08:45:10] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:45:42] RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Volans) [08:58:42] <_joe_> jouncebot: now [08:58:42] For the next 0 hour(s) and 1 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T0800) [09:00:41] (03PS1) 10Volans: admin: add ldap-only entry for Wangombe [puppet] - 10https://gerrit.wikimedia.org/r/874778 (https://phabricator.wikimedia.org/T325828) [09:07:20] (03PS1) 10Muehlenhoff: Update email address for jclark [puppet] - 10https://gerrit.wikimedia.org/r/874779 [09:16:56] PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:17:00] (03CR) 10Muehlenhoff: [C: 03+2] Update email address for jclark [puppet] - 10https://gerrit.wikimedia.org/r/874779 (owner: 10Muehlenhoff) [09:26:47] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10Vgutierrez) [09:27:30] RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:27:51] !log restarting varnish on cp5032 to clear VarnishChildRestarted alert - T325797 [09:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:54] T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 [09:38:16] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:logstash::production: mediawiki-http-accesslog [puppet] - 10https://gerrit.wikimedia.org/r/867136 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [09:40:46] (03PS1) 10Marostegui: realm.pp: Add two new tables to private [puppet] - 10https://gerrit.wikimedia.org/r/874781 (https://phabricator.wikimedia.org/T326105) [09:40:59] (03CR) 10Marostegui: "This requires sanitarium restarts" [puppet] - 10https://gerrit.wikimedia.org/r/874781 (https://phabricator.wikimedia.org/T326105) (owner: 10Marostegui) [09:52:15] (03PS1) 10Muehlenhoff: Update email address for uzoma [puppet] - 10https://gerrit.wikimedia.org/r/874783 [09:53:45] (03CR) 10Muehlenhoff: [C: 03+2] Update email address for uzoma [puppet] - 10https://gerrit.wikimedia.org/r/874783 (owner: 10Muehlenhoff) [09:54:24] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) Kafka and logstash ingestion points configured. [09:54:35] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) [09:55:29] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [09:57:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/874778 (https://phabricator.wikimedia.org/T325828) (owner: 10Volans) [09:57:44] (03CR) 10Volans: [C: 03+2] admin: add ldap-only entry for Wangombe [puppet] - 10https://gerrit.wikimedia.org/r/874778 (https://phabricator.wikimedia.org/T325828) (owner: 10Volans) [09:57:54] (03PS2) 10Volans: admin: add ldap-only entry for Wangombe [puppet] - 10https://gerrit.wikimedia.org/r/874778 (https://phabricator.wikimedia.org/T325828) [10:02:16] (03PS4) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 [10:02:54] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10Volans) 05Open→03Resolved @Wangombe you've been added to the LDAP `wmf` group. I'm resolving this task, feel free to re-open if there is any access-related issue. [10:02:59] (03CR) 10Clément Goubert: mediawiki: Add GeoIP data to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [10:07:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] P:kubernetes::mediawiki_runner: Copy GeoIP data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:08:25] (03PS49) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:08:27] (03PS9) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 [10:13:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] P:kubernetes::mediawiki_runner: Copy GeoIP data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:16:41] (03PS1) 10Marostegui: control-mariadb-11.0-bullseye: Control for new MariaDB 11 [software] - 10https://gerrit.wikimedia.org/r/874784 (https://phabricator.wikimedia.org/T326116) [10:18:10] (03PS6) 10Clément Goubert: P:kubernetes::mediawiki_runner: Copy GeoIP data [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) [10:18:38] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [10:19:44] (03CR) 10Marostegui: [C: 03+2] control-mariadb-11.0-bullseye: Control for new MariaDB 11 [software] - 10https://gerrit.wikimedia.org/r/874784 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [10:20:19] We will also be restarting gerrit1001 (production gerrit) soon (~15 minutes). Ping me or hashar here if you have any problems during that time [10:20:31] (03Merged) 10jenkins-bot: control-mariadb-11.0-bullseye: Control for new MariaDB 11 [software] - 10https://gerrit.wikimedia.org/r/874784 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [10:21:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:21:15] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38932/console" [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:24:07] (03CR) 10Clément Goubert: [V: 03+1] P:kubernetes::mediawiki_runner: Copy GeoIP data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:25:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2002.wikimedia.org [10:25:36] (03PS1) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/874785 (https://phabricator.wikimedia.org/T303168) [10:31:05] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gerrit1001.wikimedia.org [10:33:20] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:26] PROBLEM - Host gerrit.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [10:33:34] ^ thats expected [10:33:48] should recover in around 5 minutes [10:35:40] RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:36:10] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1001.wikimedia.org [10:36:49] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error [10:37:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error [10:37:07] 10SRE, 10ops-eqiad, 10serviceops: Broken CPU on parse1002 - https://phabricator.wikimedia.org/T326119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8a35d570-5625-4e3d-a6ff-eb737a303711) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: CPU1 machi... [10:37:38] (03PS1) 10Jgiannelos: maps: Disable osm sync temporarily due to wrong data [puppet] - 10https://gerrit.wikimedia.org/r/874806 (https://phabricator.wikimedia.org/T325293) [10:37:57] (03CR) 10CI reject: [V: 04-1] maps: Disable osm sync temporarily due to wrong data [puppet] - 10https://gerrit.wikimedia.org/r/874806 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [10:38:19] (03CR) 10Btullis: [C: 03+2] Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/874785 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [10:38:56] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) p:05Triage→03High a:03Cmjohnson [10:39:40] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) p:05High→03Medium [10:40:36] (03CR) 10FNegri: cloud cumin: fix ssh config for codf1dev bastion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869816 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:43:43] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint1002.wikimedia.org [10:45:58] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:23] (03CR) 10Ladsgroup: [C: 03+1] realm.pp: Add two new tables to private [puppet] - 10https://gerrit.wikimedia.org/r/874781 (https://phabricator.wikimedia.org/T326105) (owner: 10Marostegui) [10:47:20] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) a:05Cmjohnson→03Jclark-ctr [10:48:04] (03PS3) 10Filippo Giunchedi: decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) [10:48:59] (03CR) 10CI reject: [V: 04-1] decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [10:49:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint1002.wikimedia.org [10:49:42] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add two new tables to private [puppet] - 10https://gerrit.wikimedia.org/r/874781 (https://phabricator.wikimedia.org/T326105) (owner: 10Marostegui) [10:50:43] !log Restart codfw sanitarium masters T326105 [10:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:46] T326105: Create cu_log_event and cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T326105 [10:52:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:kubernetes::mediawiki_runner: Copy GeoIP data [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:53:05] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint2002.wikimedia.org [10:53:53] !log Restart eqiad sanitarium T326105 [10:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] (03CR) 10Ladsgroup: [C: 03+1] "we should make this more central to ease future changes. Do you have a ticket so I can play with puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/874776 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [10:54:20] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:kubernetes::mediawiki_runner: Copy GeoIP data [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:55:09] (03CR) 10Marostegui: mariadb: Change x1 to STATEMENT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874776 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [10:55:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:57:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Change x1 to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/874776 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [10:58:13] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10BTullis) [10:58:50] (03PS1) 10Marostegui: core_test.pp: Add wmf-mariadb110 [puppet] - 10https://gerrit.wikimedia.org/r/874807 (https://phabricator.wikimedia.org/T326116) [10:58:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2002.wikimedia.org [10:59:11] (03CR) 10Jbond: "FYI theses are captured in https://github.com/b4ldr/puppetlabs-concat/pull/1 perhaps we should continue to use something similar to make s" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [10:59:15] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint2001.wikimedia.org [10:59:19] we are restarting the CI server [10:59:51] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement [10:59:54] (03PS4) 10Filippo Giunchedi: decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1100) [11:00:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1080,1084].eqiad.wmnet with reason: Shutting down to enable RAID battery replacement [11:00:09] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2cd00c3a-edec-4e81-b729-7bd9792fa064) set by btullis@cumin1001 for 7 days, 0:00:00 on... [11:01:09] (03CR) 10David Caro: [C: 03+1] "Thanks a lot for this!" [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) (owner: 10Majavah) [11:03:22] (03CR) 10David Caro: [C: 03+2] openstack: wmf_sink: tell enc-api that we handle Git updates [puppet] - 10https://gerrit.wikimedia.org/r/871296 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [11:03:24] (03PS5) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (https://phabricator.wikimedia.org/T288375) [11:03:51] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10BTullis) [11:04:11] (03CR) 10Hashar: "recheck after CI server reboot" [puppet] - 10https://gerrit.wikimedia.org/r/874807 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [11:04:25] (03CR) 10CI reject: [V: 04-1] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [11:04:28] (03PS2) 10Jgiannelos: maps: Disable osm sync temporarily due to wrong data [puppet] - 10https://gerrit.wikimedia.org/r/874806 (https://phabricator.wikimedia.org/T325293) [11:04:31] (03CR) 10Hashar: "recheck after CI server reboot" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [11:04:37] (03CR) 10Hashar: "recheck after CI server reboot" [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [11:04:48] !log Change x1 binlog format to STATEMENT T255174 [11:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:52] T255174: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 [11:05:19] 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) I've downtimed the host, shut it down, and created a hardware ticket for @Cmjohnson to replace the RAID controller battery: T326127 [11:06:13] 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) a:03BTullis [11:06:51] !log contint2001: starting Jenkins manually [11:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:53] 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) [11:07:30] (03CR) 10CI reject: [V: 04-1] decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [11:09:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:10:00] (03CR) 10David Caro: [C: 03+2] hieradata: move codfw1dev enc api data out of enc [puppet] - 10https://gerrit.wikimedia.org/r/871297 (owner: 10Majavah) [11:12:23] (03CR) 10Filippo Giunchedi: "Change LGTM, though I'd like to keep the ability to point the script to a configuration other than the global one (e.g. for testing purpos" [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:13:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) [11:13:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] maps: Disable osm sync temporarily due to wrong data [puppet] - 10https://gerrit.wikimedia.org/r/874806 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [11:14:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) 05Open→03Resolved [11:15:10] (03CR) 10David Caro: hieradata: move codfw1dev enc api data out of enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/871297 (owner: 10Majavah) [11:16:17] (03PS1) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/874808 (https://phabricator.wikimedia.org/T303168) [11:17:00] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Change x1 to STATEMENT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874776 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [11:17:57] (03PS1) 10Volans: admin: ensure all contractors have an expiry date [puppet] - 10https://gerrit.wikimedia.org/r/874809 [11:18:42] (03CR) 10Stevemunene: [C: 03+1] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/874808 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [11:19:23] (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/874808 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [11:20:02] (03CR) 10CI reject: [V: 04-1] admin: ensure all contractors have an expiry date [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [11:20:15] (03PS1) 10Hashar: jenkins: disable systemd start timeout [puppet] - 10https://gerrit.wikimedia.org/r/874810 [11:20:18] (03PS2) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [11:21:02] (03CR) 10Volans: [C: 04-1] "No prob for the comment but PCC is failing with a duplicate definition" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:21:03] could I trouble someone for a +2/review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/868510 for T324782 :) [11:21:04] T324782: Replace deployment-prometheus02 - https://phabricator.wikimedia.org/T324782 [11:22:50] (03CR) 10Jelto: [C: 03+2] "lgtm until we found a safe value for the timeout" [puppet] - 10https://gerrit.wikimedia.org/r/874810 (owner: 10Hashar) [11:24:55] (03PS3) 10Filippo Giunchedi: P:grafana: move some profile declarations to roles [puppet] - 10https://gerrit.wikimedia.org/r/869208 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:25:23] !log Starting rolling reboot of parse* hosts in codfw [11:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:02] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:27:18] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [11:27:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38933/console" [puppet] - 10https://gerrit.wikimedia.org/r/869208 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:28:12] (03PS2) 10Volans: admin: ensure all contractors have an expiry date [puppet] - 10https://gerrit.wikimedia.org/r/874809 [11:29:22] (03CR) 10CI reject: [V: 04-1] admin: ensure all contractors have an expiry date [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [11:30:05] (03PS2) 10Muehlenhoff: netbox: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870822 [11:30:09] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] P:grafana: move some profile declarations to roles [puppet] - 10https://gerrit.wikimedia.org/r/869208 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:30:20] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2001.wikimedia.org [11:30:20] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [11:31:16] (03CR) 10Filippo Giunchedi: [C: 03+2] decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [11:31:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:31:56] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38934/console" [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:33:40] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [11:34:10] PROBLEM - Disk space on urldownloader1001 is CRITICAL: DISK CRITICAL - free space: / 125 MB (1% inode=81%): /tmp 125 MB (1% inode=81%): /var/tmp 125 MB (1% inode=81%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [11:34:13] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] P:grafana: make the logo file customizable [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:34:20] (03PS3) 10Filippo Giunchedi: P:grafana: make the logo file customizable [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [11:34:24] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:34:50] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [11:35:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:35:55] (03CR) 10Jbond: monitoring: update monitoring files to dynamically discover config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:39:40] (03CR) 10Jbond: [C: 03+1] netbox: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2131', diff saved to https://phabricator.wikimedia.org/P42744 and previous config saved to /var/cache/conftool/dbconfig/20230103-114030-marostegui.json [11:40:40] (03CR) 10Muehlenhoff: netbox: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:44:30] (03CR) 10Filippo Giunchedi: monitoring: update monitoring files to dynamically discover config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:55:33] ACKNOWLEDGEMENT - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T322529 - The acknowledgement expires at: 2023-01-08 11:55:16. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:55:33] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T322529 - The acknowledgement expires at: 2023-01-08 11:55:16. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:27] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup) [11:57:10] (03Abandoned) 10Jgiannelos: Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [11:59:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/874826 (https://phabricator.wikimedia.org/T326133) [11:59:56] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/874827 (https://phabricator.wikimedia.org/T326133) [12:00:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/874828 (https://phabricator.wikimedia.org/T326134) [12:00:25] .18 [12:00:25] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/874829 (https://phabricator.wikimedia.org/T326134) [12:02:46] (03PS3) 10Majavah: hieradata: move codfw1dev enc api data out of enc [puppet] - 10https://gerrit.wikimedia.org/r/871297 [12:02:48] (03PS5) 10Majavah: openstack: encapi: perform git updates server-side [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) [12:02:50] (03PS1) 10Majavah: openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) [12:02:54] (03PS1) 10Majavah: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) [12:02:56] (03PS1) 10Majavah: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) [12:02:58] (03PS1) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [12:03:18] (03CR) 10Majavah: hieradata: move codfw1dev enc api data out of enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/871297 (owner: 10Majavah) [12:05:08] (03CR) 10CI reject: [V: 04-1] openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [12:06:24] (03CR) 10CI reject: [V: 04-1] openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [12:08:08] (03PS2) 10Majavah: openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) [12:08:10] (03PS2) 10Majavah: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) [12:08:12] (03PS2) 10Majavah: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) [12:08:14] (03PS2) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [12:10:53] (03PS6) 10Majavah: openstack: encapi: perform git updates server-side [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) [12:10:54] sorry for the spam.. [12:10:55] (03PS3) 10Majavah: openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) [12:10:57] (03PS3) 10Majavah: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) [12:10:59] (03PS3) 10Majavah: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) [12:11:01] (03PS3) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [12:13:17] (03CR) 10Majavah: openstack: encapi: perform git updates server-side (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) (owner: 10Majavah) [12:16:24] (03PS1) 10Marostegui: db2131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/874816 (https://phabricator.wikimedia.org/T255174) [12:20:59] (03CR) 10David Caro: [C: 03+1] "LGTM waiting for all the dependencies before deploying" [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) (owner: 10Majavah) [12:21:16] (03PS3) 10Volans: cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 [12:21:18] (03PS1) 10Volans: kafka: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/874817 [12:21:20] (03PS1) 10Volans: config: expand user's home (~) for logs dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/874818 [12:21:27] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/871297 (owner: 10Majavah) [12:26:27] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) a:05hashar→03None I have cherry-picked https://gerrit.wikimedia.org/r/868002 to fix the... [12:27:26] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): pushing wmf-puppet-dashboard updates for enc git handling [12:28:38] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): pushing wmf-puppet-dashboard updates for enc git handling (duration: 01m 12s) [12:29:14] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10Volans) I've updated https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks, see the [[ https://wikitech.wikimedia.org/w/index.php?title=Spicerack%2FCookbo... [12:30:13] (03CR) 10Marostegui: [C: 03+2] db2131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/874816 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [12:30:21] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:30:24] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling [12:31:24] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:31:51] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:33:14] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling (duration: 02m 49s) [12:33:20] (03CR) 10Jbond: "can you add a follow up patch with some code using this module, doesn't need to be complete but id like to see what this looks like in hie" [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [12:39:21] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [12:51:55] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [12:53:21] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.000219 secs https://wikitech.wikimedia.org/wiki/NTP [13:01:06] (03CR) 10Jbond: [C: 03+1] cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 (owner: 10Volans) [13:02:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/874818 (owner: 10Volans) [13:02:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/874817 (owner: 10Volans) [13:07:47] (03PS5) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [13:09:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38935/console" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [13:10:10] (03CR) 10Jbond: [V: 03+1] "done, the only thing showing on pcc is w white-space change https://puppet-compiler.wmflabs.org/output/868739/38935/webperf2003.codfw.wmne" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [13:12:57] (03CR) 10Marostegui: [C: 03+2] core_test.pp: Add wmf-mariadb110 [puppet] - 10https://gerrit.wikimedia.org/r/874807 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [13:18:10] (03PS7) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [13:18:35] Jelto and I are restarting Phabricator [13:19:10] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet [13:20:30] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: move dashboard config to a new define [puppet] - 10https://gerrit.wikimedia.org/r/871289 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [13:22:13] (03CR) 10Filippo Giunchedi: "Good point, we should probably replace this with the actual home page? Adding Keith to get his opinion" [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [13:24:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet [13:24:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM on a pragmatic basis to fix the alert and the followup on whether we really need the alert in the first place. Thank you Jaime" [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:25:01] (03PS8) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [13:29:13] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [13:31:38] (03PS2) 10Ayounsi: BGP for NTT in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/870904 (https://phabricator.wikimedia.org/T314929) [13:31:47] (03PS1) 10Marostegui: admin: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/874825 [13:36:38] (03CR) 10Ladsgroup: [C: 03+1] "why stop at three? what's wrong with -vvvvvvvvvvvv?" [puppet] - 10https://gerrit.wikimedia.org/r/874825 (owner: 10Marostegui) [13:37:25] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:38:16] (03PS1) 10Hashar: jenkins: set systemd service start timeout to 5 min [puppet] - 10https://gerrit.wikimedia.org/r/874847 [13:38:42] (03CR) 10Marostegui: [C: 03+2] admin: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/874825 (owner: 10Marostegui) [13:38:59] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:44:33] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:45:13] (03CR) 10Jelto: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/874847 (owner: 10Hashar) [13:46:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:44] !log installing libksba security updates [13:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:51:21] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38936/console" [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:51:35] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:53:22] (03CR) 10Volans: [C: 03+2] kafka: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/874817 (owner: 10Volans) [13:53:30] (03CR) 10Volans: [C: 03+2] cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 (owner: 10Volans) [13:53:37] (03CR) 10Volans: [C: 03+2] config: expand user's home (~) for logs dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/874818 (owner: 10Volans) [13:55:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:56:12] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Add the ClusterIP of kubernetes.default.cluster.local to cert [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:56:58] (03Merged) 10jenkins-bot: kafka: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/874817 (owner: 10Volans) [13:57:16] (03Merged) 10jenkins-bot: cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 (owner: 10Volans) [13:57:19] (03Merged) 10jenkins-bot: config: expand user's home (~) for logs dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/874818 (owner: 10Volans) [13:58:04] (03PS1) 10Bartosz Dziewoński: a/b test anonymous ID was being reset because of cookie prefixes [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874868 (https://phabricator.wikimedia.org/T321961) [13:58:06] (03PS1) 10Bartosz Dziewoński: Log bucket/token for the DiscussionTools mobile a/b test [extensions/VisualEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874869 (https://phabricator.wikimedia.org/T321961) [13:58:08] (03PS1) 10Bartosz Dziewoński: Log token for the DiscussionTools mobile a/b test [extensions/WikiEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874866 (https://phabricator.wikimedia.org/T321961) [13:58:14] (03PS1) 10Bartosz Dziewoński: Log bucket/token for the DiscussionTools mobile a/b test [extensions/MobileFrontend] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874867 (https://phabricator.wikimedia.org/T321961) [13:58:59] (03PS1) 10Bartosz Dziewoński: Revert "Revert "Start mobile DiscussionTools A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874870 (https://phabricator.wikimedia.org/T321961) [13:59:07] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:09] (03PS2) 10Bartosz Dziewoński: Revert "Revert "Start mobile DiscussionTools A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874870 (https://phabricator.wikimedia.org/T321961) [13:59:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1400). [14:00:05] _joe_ and tzatziki: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1400) [14:00:07] ^ puppedb1003 is me, that's a new WIP host, not the main puppetdb instance [14:00:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:00:11] o/ I can deploy today [14:00:20] (03CR) 10JMeybohm: "LGTM but you will probably want to add YAML anchors for the IP(s) and port so that they can be reused by actual services (like in https://" [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [14:00:36] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10LSobanski) [14:00:39] hello! i am here [14:00:47] <_joe_> o/ [14:00:49] tzatziki: hi! which branch is your backport for? [14:00:50] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10LSobanski) [14:01:16] _joe_: do you want to self-service or want me to deploy for you? [14:01:28] truthfully I don't know - I haven't done this too many times before :D [14:01:41] is Reedy here? :) [14:01:47] <_joe_> taavi: I'm going to do the first one myself now, for the second patch I'd wait the back of the queue [14:01:53] hi, i have some patches i'd like to get deployed too, i just added them. sorry i'm a little late [14:01:53] <_joe_> it requires somewhat careful testing [14:01:55] PROBLEM - DPKG on puppetdb1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:02:05] _joe_: ack, please ping me when ready [14:02:17] (03CR) 10Ottomata: "LGTM other than the one comment! Please merge at will" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [14:02:18] mediawiki/extensions/SecurePoll master, it seems [14:02:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd: use the v3-style SRV record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841139 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [14:02:42] MatmaRex: ok! you have quite a few patches, so not sure if we'll have time for everything [14:02:46] <_joe_> ok, I'm gonna deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/841139, I'll let you know when it's done [14:02:59] taavi: yeah i know, that [14:03:03] <_joe_> taavi: as I said, I can wait for another slot for the other patch, it's not time sensitive [14:03:18] 's fine, we can leave them for the next window if needed [14:03:19] (03PS2) 10Giuseppe Lavagetto: etcd: use the v3-style SRV record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841139 (https://phabricator.wikimedia.org/T320397) [14:03:50] 10SRE, 10serviceops, 10Maps (Maps-data): Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229 (10LSobanski) [14:04:27] (03PS4) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) [14:04:35] tzatziki: that's the repository. 'branch' refers to which version we are backporting to. since the commit is included in wmf.17 (will be deployed to group0 today), I assume you'll need to backport it to wmf.14 (which is live everywhere else)? [14:05:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841139 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [14:05:16] (03PS1) 10Majavah: SecurePoll: Add files for UCoC 2023 vote [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874871 (https://phabricator.wikimedia.org/T324793) [14:05:37] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]] [14:05:39] T320397: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 [14:06:21] taavi: ah I see... I don't know. It may be that it just runs on the train based on comments in the Phab task so backport may not be necessary? [14:06:31] (Sorry for my cluelessness :D ) [14:07:08] tzatziki: well, that depends on when you're intending to run the script :P [14:07:15] (03CR) 10CI reject: [V: 04-1] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:07:30] !log oblivian@deploy1002 oblivian and oblivian: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:07:52] taavi: fair :) We'd like to run it quite soon since the vote is in a couple of weeks only [14:08:55] tzatziki: if you need to run it this week, then you're going to need a backport for it (and 868464 too) [14:09:09] if you're fine waiting for next week, then you don't need to backport it [14:09:34] (unless the train isn't deployed this week, which is possible but rare) [14:09:37] ah I see. ideally this week if at all possible [14:09:57] ok, then you're going to need to backport it [14:10:29] give me a second, I'll prepare the patches for it [14:10:59] taavi: no worries - lemme know if there's anything I can do to support, though I'm very new to this [14:13:35] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:841139|etcd: use the v3-style SRV record (T320397)]] (duration: 07m 58s) [14:13:37] <_joe_> taavi: done, for some reason the php restarts were quite slow today [14:13:39] T320397: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 [14:13:44] ack, thanks [14:13:48] <_joe_> took 3 minutes [14:13:50] <_joe_> :/ [14:14:09] MatmaRex: does the ordering of your patches matter? [14:14:21] taavi: backports first in any order, then config [14:14:34] ok! I'll start from those then [14:14:35] the backports don't do anything until the config change is deployed, so i'll have to test it all together [14:14:35] 10SRE, 10DNS, 10Infrastructure-Foundations: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10LSobanski) [14:15:42] (03PS1) 10Muehlenhoff: Mask uwsgi on puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/874852 (https://phabricator.wikimedia.org/T321783) [14:15:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/WikiEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874866 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:15:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874867 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:16:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874868 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:16:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874869 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:16:37] (03PS1) 10Jbond: cloud_production: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/874853 [14:17:30] 10SRE, 10Infrastructure-Foundations: Simplify hiera lookup model - https://phabricator.wikimedia.org/T106404 (10LSobanski) [14:17:37] (03PS1) 10Majavah: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874872 (https://phabricator.wikimedia.org/T324793) [14:18:14] 10SRE, 10Infrastructure-Foundations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742 (10LSobanski) [14:18:26] (03CR) 10David Caro: [C: 03+2] openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [14:18:51] 10SRE, 10Infrastructure-Foundations: Monitoring outgoing traffic for hosts with risky services - https://phabricator.wikimedia.org/T102104 (10LSobanski) [14:19:27] (03CR) 10CI reject: [V: 04-1] cloud_production: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/874853 (owner: 10Jbond) [14:19:57] 10SRE, 10Infrastructure-Foundations: serve our production ssh known_hosts file over public HTTPS - https://phabricator.wikimedia.org/T257219 (10LSobanski) [14:21:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:21:15] (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) [14:21:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38937/console" [puppet] - 10https://gerrit.wikimedia.org/r/874853 (owner: 10Jbond) [14:21:40] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: wmf_auto_restart_{jenkins,rsync} failing on releases2002 - https://phabricator.wikimedia.org/T267795 (10LSobanski) [14:22:58] (03CR) 10JMeybohm: flink-app chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:25:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/874852 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:25:37] (03PS3) 10Majavah: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) [14:26:55] if only the CI wasn't so slow.. [14:27:35] (03CR) 10CI reject: [V: 04-1] ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:27:43] 10SRE, 10Infrastructure-Foundations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10LSobanski) Doesn't seem like uwsgi-core made it into Buster backports. Does this need a follow up or are we OK to ignore it considering where we are in terms of m... [14:29:29] 10SRE, 10Maps: Monitor PostgreSQL connection slots - https://phabricator.wikimedia.org/T168767 (10akosiaris) 05Open→03Resolved a:03akosiaris I am gonna tentatively close this. No updates in >2.5 years and doesn't look like anyone is going to work on it. If we identify we actually need this, let's open a... [14:29:42] (03PS4) 10Majavah: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) [14:30:37] (03CR) 10David Caro: [C: 03+2] openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 (owner: 10Majavah) [14:31:28] (03PS1) 10Muehlenhoff: Only add auto restart for rsync for the primary releases server [puppet] - 10https://gerrit.wikimedia.org/r/874858 (https://phabricator.wikimedia.org/T267795) [14:31:52] (03CR) 10CI reject: [V: 04-1] ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:32:25] (03Merged) 10jenkins-bot: Log token for the DiscussionTools mobile a/b test [extensions/WikiEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874866 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:32:31] (03Merged) 10jenkins-bot: Log bucket/token for the DiscussionTools mobile a/b test [extensions/MobileFrontend] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874867 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:32:35] (03Merged) 10jenkins-bot: a/b test anonymous ID was being reset because of cookie prefixes [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874868 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:32:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [14:32:37] PROBLEM - url_downloader on urldownloader1002 is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/Url-downloader [14:32:39] (03Merged) 10jenkins-bot: Log bucket/token for the DiscussionTools mobile a/b test [extensions/VisualEditor] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874869 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:32:41] woot, finally [14:32:47] PROBLEM - url_downloader on urldownloader2001 is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/Url-downloader [14:33:02] !log taavi@deploy1002 Started scap: Backport for [[gerrit:874866|Log token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874867|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874868|a/b test anonymous ID was being reset because of cookie prefixes (T321961)]], [[gerrit:874869|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]] [14:33:05] PROBLEM - url_downloader on urldownloader1001 is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/Url-downloader [14:33:06] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [14:33:09] PROBLEM - url_downloader on urldownloader2002 is CRITICAL: connect to address url-downloader.wikimedia.org and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/Url-downloader [14:33:13] PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service,squid.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:48] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:874866|Log token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874867|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874868|a/b test anonymous ID was being reset because of cookie prefixes (T321961)]], [[gerrit:874869|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]] synced to the testservers: [14:34:48] mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:35:15] MatmaRex: the backports are now available for testing. did you say that they needed the config change to be tested? [14:35:30] yeah [14:35:38] ok, then I'll sync and pull the config change next [14:36:34] (03PS5) 10Majavah: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) [14:38:11] (03CR) 10Jelto: [C: 03+2] "All projects migrated to the new trusted tag. We can remove the old protected tag now." [puppet] - 10https://gerrit.wikimedia.org/r/870521 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [14:39:02] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts graphite1004.eqiad.wmnet [14:39:13] (03PS1) 10Jbond: cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) [14:39:53] (03CR) 10CI reject: [V: 04-1] cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [14:41:34] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:874866|Log token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874867|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]], [[gerrit:874868|a/b test anonymous ID was being reset because of cookie prefixes (T321961)]], [[gerrit:874869|Log bucket/token for the DiscussionTools mobile a/b test (T321961)]] (duration: 08m 31s) [14:41:37] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [14:42:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874870 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:43:35] RECOVERY - url_downloader on urldownloader1002 is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 https://wikitech.wikimedia.org/wiki/Url-downloader [14:43:35] (03Merged) 10jenkins-bot: Revert "Revert "Start mobile DiscussionTools A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874870 (https://phabricator.wikimedia.org/T321961) (owner: 10Bartosz Dziewoński) [14:43:37] (03PS2) 10Jbond: cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) [14:43:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:43:43] RECOVERY - url_downloader on urldownloader2001 is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 https://wikitech.wikimedia.org/wiki/Url-downloader [14:43:58] !log taavi@deploy1002 Started scap: Backport for [[gerrit:874870|Revert "Revert "Start mobile DiscussionTools A/B test"" (T321961)]] [14:44:01] RECOVERY - url_downloader on urldownloader1001 is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 https://wikitech.wikimedia.org/wiki/Url-downloader [14:44:07] RECOVERY - url_downloader on urldownloader2002 is OK: TCP OK - 0.001 second response time on url-downloader.wikimedia.org port 8080 https://wikitech.wikimedia.org/wiki/Url-downloader [14:44:11] RECOVERY - Check systemd state on urldownloader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38939/console" [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [14:44:50] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [14:45:01] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Maps (Maps-data): publish kartotherian / tilerator metrics by cluster - https://phabricator.wikimedia.org/T150466 (10akosiaris) 05Open→03Invalid 6 years since this task was opened, 4 years since the last update, I am gonna tentatively resolve this as invali... [14:45:10] 10SRE, 10Maps: Monitor PostgreSQL connection slots - https://phabricator.wikimedia.org/T168767 (10akosiaris) 05Resolved→03Invalid [14:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:46] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:874870|Revert "Revert "Start mobile DiscussionTools A/B test"" (T321961)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:45:57] MatmaRex: the config change is now available for testing [14:46:03] (03CR) 10CI reject: [V: 04-1] cumin::target: use concat to manage the file [puppet] - 10https://gerrit.wikimedia.org/r/874859 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [14:46:29] looking [14:47:07] taavi: seems good [14:47:15] ok, syncing [14:48:19] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10akosiaris) This is more than 2 years old and the latest updates don't offer any new info.... [14:48:42] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: graphite1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [14:49:30] taavi: is there anything I need to do? [14:49:47] tzatziki: not right now. your patches are up next [14:49:57] OK, cool, thank you! [14:53:12] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:874870|Revert "Revert "Start mobile DiscussionTools A/B test"" (T321961)]] (duration: 09m 13s) [14:53:15] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [14:53:18] MatmaRex: done! [14:53:25] thank you [14:53:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874871 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:53:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874872 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:53:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:53:55] tzatziki: ok, doing the securepoll patches now. [14:54:05] sounds good! [14:54:45] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:49] (03Merged) 10jenkins-bot: SecurePoll: Add files for UCoC 2023 vote [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874871 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:55:51] (03Merged) 10jenkins-bot: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874873 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:56:05] (03Merged) 10jenkins-bot: ucoc2023: Update populateEditCount to count Flow edits [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874872 (https://phabricator.wikimedia.org/T324793) (owner: 10Majavah) [14:56:19] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:32] !log taavi@deploy1002 Started scap: Backport for [[gerrit:874871|SecurePoll: Add files for UCoC 2023 vote (T324793)]], [[gerrit:874872|ucoc2023: Update populateEditCount to count Flow edits (T324793)]], [[gerrit:874873|ucoc2023: Update populateEditCount to count Flow edits (T324793)]] [14:56:35] T324793: Create SecurePoll voter list for 2023 Universal Code of Conduct vote on revised enforcement guidelines - https://phabricator.wikimedia.org/T324793 [14:58:24] !log taavi@deploy1002 taavi and taavi: Backport for [[gerrit:874871|SecurePoll: Add files for UCoC 2023 vote (T324793)]], [[gerrit:874872|ucoc2023: Update populateEditCount to count Flow edits (T324793)]], [[gerrit:874873|ucoc2023: Update populateEditCount to count Flow edits (T324793)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:58:40] these can't be tested on mwdebug, so deploying directly [14:59:15] (03CR) 10David Caro: [C: 03+2] keyholder: add fake instance-puppet keys [labs/private] - 10https://gerrit.wikimedia.org/r/871294 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:59:15] ok [14:59:24] (03CR) 10David Caro: [V: 03+2 C: 03+2] keyholder: add fake instance-puppet keys [labs/private] - 10https://gerrit.wikimedia.org/r/871294 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:59:57] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: graphite1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1001" [14:59:57] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:57] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] keyholder: add fake instance-puppet keys [labs/private] - 10https://gerrit.wikimedia.org/r/871294 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:59:57] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts graphite1004.eqiad.wmnet [15:01:36] 10ops-eqiad, 10decommission-hardware, 10User-fgiunchedi: decommission graphite1004.eqiad.wmnet - https://phabricator.wikimedia.org/T324089 (10fgiunchedi) a:03Jclark-ctr [15:01:50] 10ops-eqiad, 10decommission-hardware, 10User-fgiunchedi: decommission graphite1004.eqiad.wmnet - https://phabricator.wikimedia.org/T324089 (10fgiunchedi) @Jclark-ctr host is all yours for decom [15:02:20] (03PS1) 10Matthias Mullie: Squashed diff to catch up to wmf/1.40.0-wmf.17 [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874887 [15:03:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "Simple enough and non-production that I'll self-merge" [puppet] - 10https://gerrit.wikimedia.org/r/864727 (owner: 10Filippo Giunchedi) [15:03:18] jouncebot: nowandnext [15:03:18] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [15:03:18] In 1 hour(s) and 56 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1700) [15:03:32] claime: the mw backport window is running a bit over time [15:03:40] taavi: ack [15:03:58] I'll wait to reboot parsoid in eqiad then :) [15:04:10] thanks! give me just a few more minutes [15:04:15] No rush [15:04:43] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:874871|SecurePoll: Add files for UCoC 2023 vote (T324793)]], [[gerrit:874872|ucoc2023: Update populateEditCount to count Flow edits (T324793)]], [[gerrit:874873|ucoc2023: Update populateEditCount to count Flow edits (T324793)]] (duration: 08m 10s) [15:04:46] T324793: Create SecurePoll voter list for 2023 Universal Code of Conduct vote on revised enforcement guidelines - https://phabricator.wikimedia.org/T324793 [15:04:50] (03PS2) 10Filippo Giunchedi: services: remove old graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524) [15:05:03] tzatziki: done! the scripts are now available in production for both live versions [15:05:15] taavi: woot \o/ [15:05:17] _joe_: sorry, we didn't have time for your last patch [15:05:19] claime: all done! [15:05:21] thank you for putting up with me :) [15:05:34] <_joe_> taavi: as I said, it's ok [15:05:35] !log UTC afternoon backports done [15:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:57] (03CR) 10Filippo Giunchedi: "Hosts are decom'd now, what do you think ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:06:56] !log Starting rolling reboot of parse* hosts in eqiad [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:07:03] (03PS3) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [15:07:48] 10SRE, 10Infrastructure-Foundations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10MoritzMuehlenhoff) 05Open→03Declined Buster is now in LTS stage and we'll migrate away off it until September, so a backport doesn't make sense any more. [15:08:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/874858 (https://phabricator.wikimedia.org/T267795) (owner: 10Muehlenhoff) [15:09:20] 10SRE, 10DNS, 10Infrastructure-Foundations: Reverse DNS missing for some hosts - https://phabricator.wikimedia.org/T251522 (10Volans) Is this report still valid @Reedy ? All direct and reverse records for the hosts are automatically generated from Netbox nowadays, so this should not happen anymore. ` $ dig... [15:09:50] (03CR) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [15:10:36] (03CR) 10Majavah: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [15:10:38] (03CR) 10Filippo Giunchedi: "Change technically LGTM (see also inline)" [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [15:10:44] (03PS5) 10Clément Goubert: service::catalog: Add aux-k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) [15:10:46] !log upgrading and rebooting wikitech-static [15:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:10] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [15:11:15] 10SRE, 10Infrastructure-Foundations: serve our production ssh known_hosts file over public HTTPS - https://phabricator.wikimedia.org/T257219 (10Volans) 05Open→03Resolved a:03Volans This is already available on https://config-master.wikimedia.org/ , resolving. [15:12:17] (03PS1) 10Btullis: Switch the cephosd servers to manual partition configuration [puppet] - 10https://gerrit.wikimedia.org/r/874888 (https://phabricator.wikimedia.org/T324670) [15:12:56] (03PS3) 10FNegri: Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [15:12:59] (03CR) 10Filippo Giunchedi: "LGTM, see inline for a naming nit" [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [15:13:06] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:13:11] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [15:13:25] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:13:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:13:38] (03CR) 10CI reject: [V: 04-1] Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:13:41] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [15:13:58] (03PS1) 10Matthias Mullie: Change IW breakpoint to be enabled on smaller screen [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874889 (https://phabricator.wikimedia.org/T321377) [15:14:01] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:14:08] (03CR) 10FNegri: Use a single file for public key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:14:45] 10SRE, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: wmf_auto_restart_{jenkins,rsync} failing on releases2002 - https://phabricator.wikimedia.org/T267795 (10MoritzMuehlenhoff) For jenkins this was already fixed in 2e88687c45b980bfb31b3fdc25e11377d1a49f48, for rsyncd I created https:/... [15:15:05] (03PS1) 10Matthias Mullie: Always show search results at full width [skins/MinervaNeue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874890 (https://phabricator.wikimedia.org/T321377) [15:15:22] (03PS4) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) [15:16:07] (03CR) 10Btullis: [C: 03+2] Switch the cephosd servers to manual partition configuration [puppet] - 10https://gerrit.wikimedia.org/r/874888 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [15:16:32] (03PS5) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) [15:16:34] (03PS1) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [15:17:13] (03CR) 10Jbond: monitoring: update monitoring files to dynamically discover config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:18:24] (03PS6) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [15:18:35] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [15:22:31] (03PS4) 10FNegri: Use a single file for public key [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [15:22:46] (03CR) 10Herron: "Thanks for this! Please see one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [15:22:49] (03Restored) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (owner: 10Matthias Mullie) [15:22:53] (03PS2) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 [15:22:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38940/console" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:25:38] (03Abandoned) 10Jbond: cloud_production: do not merge [puppet] - 10https://gerrit.wikimedia.org/r/874853 (owner: 10Jbond) [15:26:07] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 (owner: 10Ssingh) [15:26:31] (03PS4) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 [15:26:35] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:27:19] (03PS7) 10Majavah: openstack: encapi: perform git updates server-side [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) [15:27:21] (03PS4) 10Majavah: openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) [15:27:23] (03PS4) 10Majavah: openstack: encapi: new id-based api for Terraform [puppet] - 10https://gerrit.wikimedia.org/r/874812 (https://phabricator.wikimedia.org/T317478) [15:27:25] (03PS4) 10Majavah: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) [15:27:27] (03PS4) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [15:27:29] (03PS1) 10Majavah: hieradata: use port 443 for enc access [puppet] - 10https://gerrit.wikimedia.org/r/874894 [15:27:52] (03PS3) 10Ssingh: site.pp: update LVS hosts in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/866441 (https://phabricator.wikimedia.org/T317247) [15:28:05] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:30:01] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1001.eqiad.wmnet with OS bullseye [15:30:11] RECOVERY - Disk space on urldownloader1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [15:30:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868458 (owner: 10Ssingh) [15:33:30] (03CR) 10FNegri: [C: 04-1] "This fails with" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:36:44] (03CR) 10Ottomata: [C: 03+1] kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [15:36:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10mpopov) [15:37:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [15:39:33] (03CR) 10Jbond: blackbox::check::http: change expiry check value from days to seconds (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [15:41:05] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:41:56] (03CR) 10Volans: "early feedback as requested" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [15:43:14] (03CR) 10David Caro: karma: add metrcsinfra alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [15:43:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/874852 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [15:43:56] (03PS4) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) [15:45:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:47:20] (03CR) 10Jbond: [C: 03+2] kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [15:48:55] (03PS2) 10Jbond: blackbox::check::http: change expiry check value from days to seconds [puppet] - 10https://gerrit.wikimedia.org/r/866594 [15:49:50] (03CR) 10JHathaway: [V: 03+1 C: 03+2] concat: make compatible *again* with stretch hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [15:51:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:52:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10mpopov) [15:53:02] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:53:19] (03PS1) 10Muehlenhoff: Remove access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/874896 [15:54:05] (03PS1) 10Jbond: spicerack: drop the empty check [puppet] - 10https://gerrit.wikimedia.org/r/874897 [15:54:55] (03PS5) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [15:55:17] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: perform git updates server-side [puppet] - 10https://gerrit.wikimedia.org/r/871298 (https://phabricator.wikimedia.org/T306642) (owner: 10Majavah) [15:55:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38942/console" [puppet] - 10https://gerrit.wikimedia.org/r/874897 (owner: 10Jbond) [15:55:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/874896 (owner: 10Muehlenhoff) [15:56:08] andrewbogott: shall I puppet-merge your patch along? [15:56:18] moritzm: yes please! [15:56:30] ack, done [15:57:49] (03CR) 10CI reject: [V: 04-1] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [15:58:04] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) a:03fnegri [15:58:16] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) p:05Triage→03Medium [16:01:15] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [16:02:22] (03PS6) 10Jbond: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:04:36] (03CR) 10CI reject: [V: 04-1] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:07:16] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [16:07:28] (03PS5) 10Andrew Bogott: openstack: encapi: format the forbidden errors as json [puppet] - 10https://gerrit.wikimedia.org/r/874811 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [16:07:54] (03PS2) 10Majavah: P:grafana: stop provisioning home dashboard [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) [16:07:56] (03PS4) 10Majavah: P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) [16:07:58] (03PS4) 10Majavah: P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465) [16:08:00] (03PS2) 10Majavah: P:wmcs::metricsinfra: add internal name for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/871291 (https://phabricator.wikimedia.org/T307465) [16:08:02] (03PS2) 10Majavah: P:wmcs::metricsinfra::grafana: configure data sources [puppet] - 10https://gerrit.wikimedia.org/r/871292 (https://phabricator.wikimedia.org/T307465) [16:08:30] (03CR) 10Majavah: P:grafana: stop provisioning home dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [16:11:32] (03PS7) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [16:13:40] (03CR) 10Jbond: Make sure cloud_cumin public key is evaluated (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:13:53] (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks good. I've reached out for clarification of the remaining three accounts and will trigger a CI check one all results are i" [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [16:14:16] (03CR) 10CI reject: [V: 04-1] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:15:41] (03PS8) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [16:17:08] (03PS1) 10Ladsgroup: Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) [16:17:47] (03CR) 10CI reject: [V: 04-1] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:19:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/874897 (owner: 10Jbond) [16:19:43] (03PS9) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [16:20:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:23:10] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:31:16] (03CR) 10BCornwall: [C: 03+2] check_user.py: Fix GSuite misspelling [puppet] - 10https://gerrit.wikimedia.org/r/870994 (owner: 10BCornwall) [16:37:27] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38945/console" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:46:22] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38946/console" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:47:38] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10phaultfinder) [16:47:43] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38947/console" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [16:53:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] spicerack: drop the empty check [puppet] - 10https://gerrit.wikimedia.org/r/874897 (owner: 10Jbond) [16:53:44] (03PS10) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [16:53:52] brett: fyi i merged you chec_user spelling cr [16:53:52] (03CR) 10Herron: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [16:56:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [16:58:25] (03CR) 10Filippo Giunchedi: blackbox::check::http: change expiry check value from days to seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [17:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:03:00] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) >>! In T325824#8488220, @ayounsi wrote: > I emailed the RIPE to let them know this anchor is definitively offline. Thanks! I'll pull and toss in hardware recycling bin at the datacente... [17:04:46] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38948/console" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [17:07:27] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) 05Open→03Stalled p:05Low→03Lowest a:05ayounsi→03RobH Throwing away a defunct item in our racks isn't really worth a specific trip into the datacenter, so settting this to low... [17:14:27] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:16:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:21:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:19] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:30:37] !log sudo cumin -b 1 -s 5 'A:codfw and P{O:swift::proxy}' 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool' [17:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:22] (03PS4) 10Raymond Ndibe: o tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [17:34:33] (03CR) 10Raymond Ndibe: o tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [17:36:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [17:37:19] !log Finished parse reboots in eqiad [17:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:29] (03PS5) 10Raymond Ndibe: tools-webservice: read buildservice_repository from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [17:44:33] (03CR) 10Vlad.shapik: "Thank you for this update." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 (owner: 10Brion VIBBER) [17:54:24] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:54:29] (03PS21) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [17:55:03] (03PS12) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [17:56:47] (03CR) 10jenkins-bot: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:57:18] (03PS1) 10Andrew Bogott: New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) [17:57:59] (03CR) 10CI reject: [V: 04-1] New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1800) [18:03:34] (03PS2) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 [18:03:36] (03PS1) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 [18:21:30] (03PS1) 10Andrew Bogott: Add package manifests for OpenStack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874910 (https://phabricator.wikimedia.org/T323086) [18:21:32] (03PS1) 10Andrew Bogott: Add manifest for OpenStack designate version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874911 (https://phabricator.wikimedia.org/T323086) [18:21:34] (03PS1) 10Andrew Bogott: Add manifest for openstack placement service version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874912 (https://phabricator.wikimedia.org/T323086) [18:21:36] (03PS1) 10Andrew Bogott: Add manifests for OpenStack nova version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874913 (https://phabricator.wikimedia.org/T323086) [18:21:38] (03PS1) 10Andrew Bogott: Add manifests for OpenStack cinder version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874914 (https://phabricator.wikimedia.org/T323086) [18:21:40] (03PS1) 10Andrew Bogott: Add manifest for OpenStack Glance version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874915 (https://phabricator.wikimedia.org/T323086) [18:21:42] (03PS1) 10Andrew Bogott: Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) [18:21:44] (03PS1) 10Andrew Bogott: Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) [18:21:46] (03PS1) 10Andrew Bogott: Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) [18:21:48] (03PS1) 10Andrew Bogott: Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) [18:21:50] (03PS1) 10Andrew Bogott: Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) [18:21:52] (03PS1) 10Andrew Bogott: Add manifest for OpenStack Barbican version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) [18:23:35] (03CR) 10Ottomata: flink-app chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:24:03] (03PS13) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [18:25:24] (03PS2) 10Andrew Bogott: New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) [18:25:26] (03PS2) 10Andrew Bogott: Add package manifests for OpenStack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874910 (https://phabricator.wikimedia.org/T323086) [18:25:28] (03PS2) 10Andrew Bogott: Add manifest for OpenStack designate version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874911 (https://phabricator.wikimedia.org/T323086) [18:25:30] (03PS2) 10Andrew Bogott: Add manifest for openstack placement service version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874912 (https://phabricator.wikimedia.org/T323086) [18:25:32] (03PS2) 10Andrew Bogott: Add manifests for OpenStack nova version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874913 (https://phabricator.wikimedia.org/T323086) [18:25:34] (03PS2) 10Andrew Bogott: Add manifests for OpenStack cinder version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874914 (https://phabricator.wikimedia.org/T323086) [18:25:36] (03PS2) 10Andrew Bogott: Add manifest for OpenStack Glance version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874915 (https://phabricator.wikimedia.org/T323086) [18:25:38] (03PS2) 10Andrew Bogott: Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) [18:25:40] (03PS2) 10Andrew Bogott: Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) [18:25:42] (03PS2) 10Andrew Bogott: Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) [18:25:44] (03PS2) 10Andrew Bogott: Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) [18:25:46] (03PS2) 10Andrew Bogott: Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) [18:25:48] (03PS2) 10Andrew Bogott: Add manifest for OpenStack Barbican version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) [18:25:50] (03CR) 10CI reject: [V: 04-1] Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:25:52] (03CR) 10CI reject: [V: 04-1] Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:25:54] (03CR) 10CI reject: [V: 04-1] Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:25:56] (03CR) 10CI reject: [V: 04-1] Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:25:58] (03CR) 10CI reject: [V: 04-1] Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:26:00] (03CR) 10CI reject: [V: 04-1] Add manifest for OpenStack Barbican version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [18:27:37] (03PS3) 10Andrew Bogott: New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) [18:27:39] (03PS3) 10Andrew Bogott: Add package manifests for OpenStack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874910 (https://phabricator.wikimedia.org/T323086) [18:27:41] (03PS3) 10Andrew Bogott: Add manifest for OpenStack designate version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874911 (https://phabricator.wikimedia.org/T323086) [18:27:43] (03PS3) 10Andrew Bogott: Add manifest for openstack placement service version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874912 (https://phabricator.wikimedia.org/T323086) [18:27:45] (03PS3) 10Andrew Bogott: Add manifests for OpenStack nova version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874913 (https://phabricator.wikimedia.org/T323086) [18:27:47] (03PS3) 10Andrew Bogott: Add manifests for OpenStack cinder version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874914 (https://phabricator.wikimedia.org/T323086) [18:27:49] (03PS3) 10Andrew Bogott: Add manifest for OpenStack Glance version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874915 (https://phabricator.wikimedia.org/T323086) [18:27:51] (03PS3) 10Andrew Bogott: Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) [18:27:53] (03PS3) 10Andrew Bogott: Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) [18:27:55] (03PS3) 10Andrew Bogott: Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) [18:27:57] (03PS3) 10Andrew Bogott: Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) [18:27:59] (03PS3) 10Andrew Bogott: Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) [18:28:01] (03PS3) 10Andrew Bogott: Add manifest for OpenStack Barbican version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) [18:36:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [18:58:22] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) >>! In T325557#8478613, @Vgutierrez wrote: > * Monitoring issue: CPU seconds for haproxy, varnish and ATS is reported as 0 on bullseye hosts: https://grafana.wikimedia.org/goto/eCGKNUc4k?orgId=1,... [19:00:04] dduvall and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T1900). [19:00:14] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:01:05] first MW train of the new year. ch00 ch00! [19:01:52] wee [19:08:23] Choo choo dduvall [19:09:14] should be a light rail kind of day hopefully [19:09:44] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874925 (https://phabricator.wikimedia.org/T325580) [19:09:46] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874925 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:11:11] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:11:27] Hello, I'm unsure if this channel is correct one, but I just want to let you know that CI is stuck. [19:11:39] I mean Zuul. https://integration.wikimedia.org/zuul/ [19:12:00] thank you, Kizule [19:12:45] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:13:44] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1001.eqiad.wmnet with OS bullseye [19:13:52] indeed. looks like a.ndrewbogott pushed a long chain of puppet patches, those have previously got Zuul stuck [19:18:34] !log dduvall@deploy1002 deploy-promote aborted: (duration: 08m 55s) [19:18:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874927 (https://phabricator.wikimedia.org/T325580) [19:18:42] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874927 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:19:43] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874927 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:20:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [19:20:25] well, the good news is that zuul/gearman are back to processing jobs. back news is that everything that was queued must be resubmitted [19:20:32] *bad* news [19:20:41] 😭 [19:21:16] perhaps this was related to the switchover to contint2001 [19:21:25] hashar: ^ any thoughts? [19:23:11] Great to see that it is working again. :) [19:23:29] Jdlrobson: Can you resubmit my patches again? :)  https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fextensions%252FQuickResponse+status:open [19:25:02] Kizule: {{done}} [19:26:46] taavi: Amazing, thank you! [19:27:05] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:27:32] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.17 refs T325580 [19:27:35] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [19:27:36] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission restbase-dev100{4,5,6} - https://phabricator.wikimedia.org/T325387 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:33:43] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4029 - https://phabricator.wikimedia.org/T321340 (10wiki_willy) a:03RobH [19:34:31] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10wiki_willy) a:03RobH [19:34:35] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:34:53] 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10wiki_willy) a:03RobH [19:35:39] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4029 - https://phabricator.wikimedia.org/T321340 (10RobH) [19:35:49] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4029 - https://phabricator.wikimedia.org/T321340 (10RobH) 05Open→03Resolved This was taken care of during the recycling, when I completed the remaining checklist steps in unison with the other decom cp hosts there. resolving. [19:35:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [19:36:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:39:42] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) a:05RobH→03ayounsi I think this actually should be over to Arzhel, as the mastership change is something they're fixing with the upgr... [19:42:04] (03PS4) 10Andrew Bogott: New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) [19:42:06] (03PS4) 10Andrew Bogott: Add package manifests for OpenStack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874910 (https://phabricator.wikimedia.org/T323086) [19:42:08] (03PS4) 10Andrew Bogott: Add manifest for OpenStack designate version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874911 (https://phabricator.wikimedia.org/T323086) [19:42:10] (03PS4) 10Andrew Bogott: Add manifest for openstack placement service version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874912 (https://phabricator.wikimedia.org/T323086) [19:42:12] (03PS4) 10Andrew Bogott: Add manifests for OpenStack nova version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874913 (https://phabricator.wikimedia.org/T323086) [19:42:14] (03PS4) 10Andrew Bogott: Add manifests for OpenStack cinder version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874914 (https://phabricator.wikimedia.org/T323086) [19:42:16] (03PS4) 10Andrew Bogott: Add manifest for OpenStack Glance version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874915 (https://phabricator.wikimedia.org/T323086) [19:42:18] (03PS4) 10Andrew Bogott: Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) [19:42:20] (03PS4) 10Andrew Bogott: Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) [19:42:22] (03PS4) 10Andrew Bogott: Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) [19:42:24] (03PS4) 10Andrew Bogott: Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) [19:42:26] (03PS4) 10Andrew Bogott: Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) [19:42:28] (03PS4) 10Andrew Bogott: Add manifest for OpenStack Barbican version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) [19:46:29] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-bearloga-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:51:09] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10RobH) [19:54:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10RobH) [19:54:39] (03CR) 10Andrew Bogott: [C: 03+2] New files/templates for OpenStack Zed [puppet] - 10https://gerrit.wikimedia.org/r/874906 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [19:55:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [19:56:17] (03CR) 10Andrew Bogott: [C: 03+2] Add package manifests for OpenStack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874910 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [19:56:20] 10SRE, 10serviceops: rdb101[34] serviceops implementation tracking - https://phabricator.wikimedia.org/T326171 (10RobH) [19:56:34] 10SRE, 10serviceops: rdb101[34] serviceops implementation tracking - https://phabricator.wikimedia.org/T326171 (10RobH) [19:59:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10RobH) [20:00:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10RobH) 05Open→03Stalled Please note this is a Q4 order being placed in Q3 for discounting, but won't land in the datacenter until April 15th or later. I'm not entirely certain how... [20:02:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:07:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:08:32] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) After an hour of writing the above and digging around some more, I think I am even less convinced about my own theory about the cause of this issue. One of the main reasons being that I haven't fo... [20:12:46] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) >>! In T325557#8496155, @ssingh wrote: > After an hour of writing the above and digging around some more, I think I am even less convinced now about my own theory about the cause of this issue. On... [20:13:39] I think zuul is stuck again [20:13:51] cc dduvall and andrewbogott [20:15:36] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for OpenStack Keystone version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874916 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:15:38] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for OpenStack Neutron version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874917 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:15:40] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for OpenStack Trove version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874918 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:15:42] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Openstack Heat version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874919 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:15:44] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for OpenStack Magnum version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874920 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:16:05] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for OpenStack Glance version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874915 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:16:55] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for OpenStack cinder version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874914 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:17:02] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for OpenStack nova version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874913 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:17:07] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for openstack placement service version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874912 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:17:11] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for OpenStack designate version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874911 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:17:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:21] (03PS5) 10Andrew Bogott: Add manifest for OpenStack Barbican version 'Zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) [20:18:50] hm, it tested everything but the very last patch. I resubmitted that one. [20:18:56] But I don't know how to unstick it :/ [20:19:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10XenoRyet) Approved from me. [20:21:12] it'll probably just sort itself out if you give it enough time [20:22:12] from https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 looks like the gearman queue isn't as big as it was last time [20:22:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:24:25] (03CR) 10Zoranzoki21: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:37:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:42:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:43:43] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for OpenStack Barbican version 'Zed' [puppet] - 10https://gerrit.wikimedia.org/r/874921 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:47:26] (03PS1) 10Andrew Bogott: Openstack Designate codfw1dev -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874937 (https://phabricator.wikimedia.org/T323086) [20:47:28] (03PS1) 10Andrew Bogott: Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) [20:49:05] (03CR) 10CI reject: [V: 04-1] Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:58:47] weirdly only gate and submit jobs are getting ran, maybe they have highest priority [20:58:48] anyway [20:59:35] (03PS1) 10Dzahn: gerrit: require interface::alias before httpd class [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) [20:59:45] yep I suspect it has something to do with the job priority [20:59:49] 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) Ok, just to outline the escalation steps as I understand them for the MX204: * all of the ports are built in, so we cannot isolate just a line card for this * we'll need to approve them potentially having to pull th... [20:59:55] it'll clear eventually, just give it a while [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230103T2100). [21:00:05] zabe and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] hi [21:00:18] hey o/ [21:00:23] I can deploy if no-one else wants to [21:00:32] (03CR) 10CI reject: [V: 04-1] gerrit: require interface::alias before httpd class [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [21:01:13] (03PS2) 10Zabe: Stop setting $wgActorTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873880 (https://phabricator.wikimedia.org/T215466) [21:01:25] zabe: looks like the first three of your patches are no-ops and then the last one has actual effects, correct? [21:01:34] taavi, yes [21:02:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873880 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [21:02:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:02:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:02:51] (03PS2) 10Majavah: Pin cu_changes comment migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:02:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873880 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [21:02:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:03:00] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:03:13] (03PS2) 10Dzahn: gerrit: require interface::alias before httpd class [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) [21:03:26] (03Merged) 10jenkins-bot: Stop setting $wgActorTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873880 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [21:04:32] (03PS4) 10Majavah: Pin $wgCommentTempTableSchemaMigrationStage to default value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:04:47] of course trying to do everything at once was a bad idea [21:04:50] (03CR) 10Majavah: [C: 03+2] Pin $wgCommentTempTableSchemaMigrationStage to default value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:05:00] (03PS3) 10Majavah: Pin cu_changes comment migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:05:06] (03CR) 10Majavah: [C: 03+2] Pin cu_changes comment migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:05:32] (03CR) 10CI reject: [V: 04-1] gerrit: require interface::alias before httpd class [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [21:05:44] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/874939/38955/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [21:05:48] (03Merged) 10jenkins-bot: Pin $wgCommentTempTableSchemaMigrationStage to default value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:06:01] (03Merged) 10jenkins-bot: Pin cu_changes comment migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:04] I can +2 the patch to see if it passes tests :P [21:06:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873880 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [21:06:18] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/873887 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:06:20] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874418 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:40] !log taavi@deploy1002 Started scap: Backport for [[gerrit:873880|Stop setting $wgActorTableSchemaMigrationStage (T215466)]], [[gerrit:873887|Pin $wgCommentTempTableSchemaMigrationStage to default value (T299954)]], [[gerrit:874418|Pin cu_changes comment migration to old schema (T233004)]] [21:06:46] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [21:06:47] T215466: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 [21:06:47] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:08:30] !log taavi@deploy1002 taavi and zabe: Backport for [[gerrit:873880|Stop setting $wgActorTableSchemaMigrationStage (T215466)]], [[gerrit:873887|Pin $wgCommentTempTableSchemaMigrationStage to default value (T299954)]], [[gerrit:874418|Pin cu_changes comment migration to old schema (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:08:46] zabe: do you want to test something or should I just sync them out? [21:09:08] you can just sync them out [21:09:15] ack, will do [21:12:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:15:29] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:873880|Stop setting $wgActorTableSchemaMigrationStage (T215466)]], [[gerrit:873887|Pin $wgCommentTempTableSchemaMigrationStage to default value (T299954)]], [[gerrit:874418|Pin cu_changes comment migration to old schema (T233004)]] (duration: 08m 49s) [21:15:34] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [21:15:35] T215466: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 [21:15:35] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:15:54] (03PS2) 10Majavah: Start writing to cuc_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874443 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:16:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874443 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:16:59] (03Merged) 10jenkins-bot: Start writing to cuc_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874443 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:17:26] !log taavi@deploy1002 Started scap: Backport for [[gerrit:874443|Start writing to cuc_comment_id on test wikis (T233004)]] [21:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:20] !log taavi@deploy1002 taavi and zabe: Backport for [[gerrit:874443|Start writing to cuc_comment_id on test wikis (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:20:17] zabe: pulled to mwdebugs. do you need me to check some database rows manually? [21:21:59] yeah, made a test edit, could you do 'select * from cu_changes where cuc_user_text="Zabe" order by cuc_id desc limit 1\G' (hope thats correct)? [21:22:25] on testwiki I assume? sure, give me a second [21:22:30] yes [21:22:33] 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) @ayounsi or @cmooney can you review the above and provide feedback and a window for the work scheduling? I can put in the remote hands request with drmrs staff. [21:22:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:23:18] 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) [21:23:46] yeah, the query was correct. I see a value in cuc_comment_id, the `comment` row with that id matches what's in cuc_comment [21:23:50] anything else? [21:23:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:24:00] nope [21:24:10] thanks, will sync [21:25:16] MatmaRex: you are up next. does the ordering of those matter? [21:25:37] hi [21:25:40] they're unrelated [21:25:47] so, no [21:26:16] ack. do you want to do those separately or at once? [21:26:44] either is fine. we can do them at once [21:27:02] sure, let's do that. saves a bit of time given how long the sync takes these days [21:27:17] (03PS2) 10Majavah: Use new DiscussionTools heading markup on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:27:44] hmm, that still claims to have a merge conflict [21:27:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:30:20] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:874443|Start writing to cuc_comment_id on test wikis (T233004)]] (duration: 12m 54s) [21:30:24] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:30:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [21:30:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:30:45] MatmaRex: I think https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874855 needs a manual rebase [21:31:47] (03Merged) 10jenkins-bot: Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [21:31:53] huh, i just wrote it today [21:32:00] oh, i see [21:32:06] (03PS3) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) [21:32:12] hmm [21:32:31] well, gerrit was complaining because i forgot to rebase the other change on master, and you rebased it on that one [21:32:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [21:32:39] but now i clicked rebase and it rebased with no conflicts [21:32:40] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:32:49] hmm, weird [21:33:16] it did show up as 'merge conflict' before I clicked rebase, so I thought I'll rebase it on the other one just to avoid conflicts after the first one merges [21:33:38] (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874855 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:34:01] !log taavi@deploy1002 Started scap: Backport for [[gerrit:869226|Specify Citoid RESTBase URL separately (T325425)]], [[gerrit:874855|Use new DiscussionTools heading markup on group1 wikis (T314714)]] [21:34:05] T325425: VE auto citation is broken in hewiki - https://phabricator.wikimedia.org/T325425 [21:34:06] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:35:51] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:869226|Specify Citoid RESTBase URL separately (T325425)]], [[gerrit:874855|Use new DiscussionTools heading markup on group1 wikis (T314714)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:36:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:37] MatmaRex: pulled both to mwdebugs, please test [21:37:11] looking [21:37:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:38:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.743 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:24] taavi: both looking good [21:40:16] thanks, syncing [21:41:43] (03PS1) 10Dzahn: phabricator: add systemd::tmpfile snippet for phd run dir [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) [21:42:48] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate codfw1dev -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/874937 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [21:42:50] (03PS2) 10Dzahn: phabricator: add systemd::tmpfile snippet for phd run dir [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) [21:44:59] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/874943/38956/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [21:46:13] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:869226|Specify Citoid RESTBase URL separately (T325425)]], [[gerrit:874855|Use new DiscussionTools heading markup on group1 wikis (T314714)]] (duration: 12m 12s) [21:46:17] T325425: VE auto citation is broken in hewiki - https://phabricator.wikimedia.org/T325425 [21:46:18] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:46:49] MatmaRex: those are now live. anything else? [21:47:09] i'm done. thank you [21:47:23] you're welcome, as always [21:47:33] !log UTC late backports done [21:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:48:42] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/868737/38958/gitlab-runner2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [21:49:01] (03PS2) 10Dzahn: profile::gitlab::runner::allowed_services: Add kubestagemaster [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [21:50:39] (03CR) 10Dzahn: [C: 03+2] profile::gitlab::runner::allowed_services: Add kubestagemaster [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [21:52:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:53:04] !log gitlab-runner* - allowing kubestagemaster.svc.eqiad.wmnet to connect to port 6443, run puppet via cumin, deploy gerrit:868737 - T325385 [21:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:08] T325385: Trusted gitlab runner containers need access to staging k8s cluster - https://phabricator.wikimedia.org/T325385 [21:55:27] !log gitlab-runner* - correction: allowing connections TO kubestagemaster.svc.eqiad.wmnet port 6443 FROM trusted runners, of course - T325385 [21:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:06] (03CR) 10Dzahn: [C: 03+2] "deployed on gitlab-runner* via cumin" [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [21:58:18] (03CR) 10Dzahn: [C: 03+2] "yea, this makes sense, thanks for the fix! did not see that old ticket but now it got tagged" [puppet] - 10https://gerrit.wikimedia.org/r/874858 (https://phabricator.wikimedia.org/T267795) (owner: 10Muehlenhoff) [22:01:12] 10SRE, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: wmf_auto_restart_{jenkins,rsync} failing on releases2002 - https://phabricator.wikimedia.org/T267795 (10Dzahn) Thanks all for surfacing this older ticket and providing the fix. Deployed. on releases2002: ` Notice: /Stage[main]/... [22:04:43] 10SRE, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: wmf_auto_restart_{jenkins,rsync} failing on releases2002 - https://phabricator.wikimedia.org/T267795 (10Dzahn) 05Open→03Resolved a:03Dzahn ` [releases2002:~] $ systemctl list-unit-files | grep wmf_auto | grep -E 'jenkins|rsyn... [22:07:11] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) @Arnoldokoth Could you maybe reach out to @Astinson to check the status on this? See the last comment from November. Cheers [22:11:56] (03PS1) 10JHathaway: rspamd example hiera data, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/874945 [22:13:06] (03CR) 10CI reject: [V: 04-1] rspamd example hiera data, DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/874945 (owner: 10JHathaway) [22:19:15] (03CR) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [22:37:09] (03CR) 10Dzahn: [C: 03+2] icinga: Make the punctuation error optional on check [puppet] - 10https://gerrit.wikimedia.org/r/868037 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [22:37:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:41:16] RECOVERY - Ensure legal html en.wb on en.wikibooks.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/ [22:42:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:42:59] 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) >>! In T317169#8466063, @jcrespo wrote: > While I agree, let's be practical and compromise, let's make the dot optional so the alert works as... [22:44:22] (03PS2) 10Ladsgroup: Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) [22:44:56] 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) 05Open→03Resolved a:03Dzahn before: {F35968127} after: {F35968129} [22:47:38] 10SRE, 10Observability-Alerting, 10observability, 10Patch-For-Review: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) P.S. shortly after this got an "out of office" agent reply from someone in legal. So confirmed this is actively alerting legal. [22:47:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:48:56] (03PS3) 10Krinkle: Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) (owner: 10Ladsgroup) [22:49:56] (03CR) 10Krinkle: [C: 03+1] "LGTM. Note that as part of T314020, the new LoadMonitor logic is also mainly for web requests. We can have CLI simply spread with the fixe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) (owner: 10Ladsgroup) [22:52:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:04:18] (03CR) 10Umherirrender: "There is a duplicate Iba11d40364475f58279a86cdefd1cc77c449b143 already merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874925 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [23:07:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:51:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:54:43] (03PS1) 10Zabe: Start writing to cuc_comment_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874957 (https://phabricator.wikimedia.org/T233004)