[00:00:20] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:10] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:47] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:47] (JobUnavailable) firing: (10) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:47] (JobUnavailable) firing: (12) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:35:13] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/882253 (https://phabricator.wikimedia.org/T327609) [03:51:22] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/882253 (https://phabricator.wikimedia.org/T327609) (owner: 10Gerrit maintenance bot) [03:52:30] !log Starting s2 codfw failover from db2107 to db2104 - T327609 [03:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:34] T327609: Switchover s2 master (db2107 -> db2104) - https://phabricator.wikimedia.org/T327609 [03:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2107 T327609', diff saved to https://phabricator.wikimedia.org/P43207 and previous config saved to /var/cache/conftool/dbconfig/20230123-035458-ladsgroup.json [03:56:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [03:56:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:02:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:02:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:12:28] (03CR) 10Ladsgroup: [C: 03+1] "Can you add the ticket?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (owner: 10Daniel Kinzler) [04:28:01] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/882254 (https://phabricator.wikimedia.org/T327611) [04:30:53] (03Abandoned) 10Ladsgroup: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/881375 (https://phabricator.wikimedia.org/T327370) (owner: 10Gerrit maintenance bot) [04:32:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T327611 [04:32:57] T327611: Switchover s5 master (db2113 -> db2123) - https://phabricator.wikimedia.org/T327611 [04:33:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T327611 [04:33:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2123 with weight 0 T327611', diff saved to https://phabricator.wikimedia.org/P43208 and previous config saved to /var/cache/conftool/dbconfig/20230123-043324-ladsgroup.json [04:51:45] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/882254 (https://phabricator.wikimedia.org/T327611) (owner: 10Gerrit maintenance bot) [04:53:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:53:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [04:57:08] !log Starting s5 codfw failover from db2113 to db2123 - T327611 [04:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:12] T327611: Switchover s5 master (db2113 -> db2123) - https://phabricator.wikimedia.org/T327611 [04:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2123 to s5 primary T327611', diff saved to https://phabricator.wikimedia.org/P43209 and previous config saved to /var/cache/conftool/dbconfig/20230123-045740-ladsgroup.json [04:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2113 T327611', diff saved to https://phabricator.wikimedia.org/P43210 and previous config saved to /var/cache/conftool/dbconfig/20230123-045939-ladsgroup.json [05:01:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:02:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:07:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:07:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [05:13:37] (03PS1) 10KartikMistry: Content Translation: Add campaign for Wiki Loves Living Heritage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882266 (https://phabricator.wikimedia.org/T327587) [05:33:43] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:34:37] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:50:49] (03PS3) 10KartikMistry: Update cxserver to 2023-01-20-051603-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881051 (https://phabricator.wikimedia.org/T323840) [05:56:33] Updating cxserver in a few minutes.. [05:57:12] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-01-20-051603-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881051 (https://phabricator.wikimedia.org/T323840) (owner: 10KartikMistry) [06:02:07] (03Merged) 10jenkins-bot: Update cxserver to 2023-01-20-051603-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881051 (https://phabricator.wikimedia.org/T323840) (owner: 10KartikMistry) [06:12:06] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:12:33] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:16:21] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:17:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:17:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:17:05] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:18:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:18:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:18:37] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:19:31] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:23:29] !log Updated cxserver to 2023-01-20-051603-production (T323840, T326236) [06:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:35] T326236: Post-creation work for gucwiki - https://phabricator.wikimedia.org/T326236 [06:23:35] T323840: Make the Google translate the default Machine Translation in Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T323840 [06:52:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:56:36] (03PS1) 10Stang: bnwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882422 (https://phabricator.wikimedia.org/T323131) [06:58:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:58:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [07:02:54] (03PS1) 10Stang: shnwikibooks: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882425 (https://phabricator.wikimedia.org/T327380) [07:05:26] (03PS2) 10Stang: bnwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882422 (https://phabricator.wikimedia.org/T323131) [07:05:46] (03PS3) 10Stang: bnwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882422 (https://phabricator.wikimedia.org/T323131) [07:08:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:08:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:09:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:09:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 db1206 T326669', diff saved to https://phabricator.wikimedia.org/P43211 and previous config saved to /var/cache/conftool/dbconfig/20230123-071323-marostegui.json [07:13:27] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:22:08] (03CR) 10Ayounsi: Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [07:23:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P43212 and previous config saved to /var/cache/conftool/dbconfig/20230123-072309-ladsgroup.json [07:24:00] (03PS1) 10Marostegui: mariadb: Switch s1 sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/882515 (https://phabricator.wikimedia.org/T326669) [07:24:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Switch s1 sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/882515 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 5%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43213 and previous config saved to /var/cache/conftool/dbconfig/20230123-072520-root.json [07:25:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43214 and previous config saved to /var/cache/conftool/dbconfig/20230123-072530-root.json [07:37:24] (03CR) 10Ayounsi: WIP: add rt_flow grokking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [07:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P43215 and previous config saved to /var/cache/conftool/dbconfig/20230123-073814-ladsgroup.json [07:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 10%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43216 and previous config saved to /var/cache/conftool/dbconfig/20230123-074025-root.json [07:40:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43217 and previous config saved to /var/cache/conftool/dbconfig/20230123-074035-root.json [07:41:52] (03PS3) 10Stang: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387) [07:42:33] PROBLEM - puppet last run on idm-test1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604942 seconds, message: test OIDC - slyngshede, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:43:52] (03PS6) 10Elukey: changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [07:43:54] (03PS7) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [07:44:20] (03CR) 10Elukey: "I added one last little change, namely the possibility to set the kafka topic :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [07:44:47] (03CR) 10Elukey: "Added the kafka topic parameter to the staging settings (now the chart allows to specify it)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [07:53:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P43218 and previous config saved to /var/cache/conftool/dbconfig/20230123-075319-ladsgroup.json [07:55:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43219 and previous config saved to /var/cache/conftool/dbconfig/20230123-075530-root.json [07:55:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43220 and previous config saved to /var/cache/conftool/dbconfig/20230123-075540-root.json [08:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T0800). Please do the needful. [08:00:05] MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:41] hi [08:00:48] is anyone really working at this hour? :D [08:02:53] MatmaRex: let me check [08:03:58] is the sync order okay? [08:04:09] <_joe_> sirenbot: wake up [08:05:34] Amir1: order shouldn't matter for this backport [08:05:39] <_joe_> sigh, didn't we give -O to it? [08:06:32] it could change the hashes of the modules and such but meh [08:06:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882174 (https://phabricator.wikimedia.org/T327328) (owner: 10Bartosz Dziewoński) [08:08:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2107 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P43221 and previous config saved to /var/cache/conftool/dbconfig/20230123-080824-ladsgroup.json [08:10:20] (03PS1) 10Func: SpecialUserrights: Allow updating the expiry of user groups [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882179 (https://phabricator.wikimedia.org/T327605) [08:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43222 and previous config saved to /var/cache/conftool/dbconfig/20230123-081035-root.json [08:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43223 and previous config saved to /var/cache/conftool/dbconfig/20230123-081045-root.json [08:12:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:29] (03Merged) 10jenkins-bot: Tweaks for new heading HTML structure [extensions/DiscussionTools] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882174 (https://phabricator.wikimedia.org/T327328) (owner: 10Bartosz Dziewoński) [08:12:47] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:882174|Tweaks for new heading HTML structure (T327328 T327469)]] [08:12:52] T327469: Subscribe buttons/links are displayed out of place due to new heading HTML structure - https://phabricator.wikimedia.org/T327469 [08:12:52] T327328: Highlight skips the topic container for new topics, which looks odd - https://phabricator.wikimedia.org/T327328 [08:13:57] (03CR) 10Muehlenhoff: "You also need to remove profile::idp::client:httpd from profile::racktables, then it will work." [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:14:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:23] _joe_: sirenbot's +O was removed by ircservserv-wm_ with the last sync as that wasn't granted via its configuration [08:14:41] <_joe_> taavi: yeah I just saw, I thought it was [08:14:51] <_joe_> I remember someone writing a patch, I assumed it was merged [08:15:03] <_joe_> I'll fix it once I'm done writing docs [08:15:49] yeah, it has +o not +O [08:16:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.943 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:19] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [08:19:03] (03CR) 10Majavah: [V: 03+1] ldap: move ssh-key-ldap-lookup directly to ssh module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [08:22:33] !log ladsgroup@deploy1002 ladsgroup and matmarex: Backport for [[gerrit:882174|Tweaks for new heading HTML structure (T327328 T327469)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:22:38] T327469: Subscribe buttons/links are displayed out of place due to new heading HTML structure - https://phabricator.wikimedia.org/T327469 [08:22:38] T327328: Highlight skips the topic container for new topics, which looks odd - https://phabricator.wikimedia.org/T327328 [08:22:42] MatmaRex: it's in mwbdeug now [08:23:21] Amir1: works as expected [08:23:48] deploying [08:25:03] (03PS1) 10Muehlenhoff: Remove openldap_corp role from ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/882573 (https://phabricator.wikimedia.org/T323820) [08:25:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43224 and previous config saved to /var/cache/conftool/dbconfig/20230123-082540-root.json [08:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43225 and previous config saved to /var/cache/conftool/dbconfig/20230123-082550-root.json [08:30:00] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:882174|Tweaks for new heading HTML structure (T327328 T327469)]] (duration: 17m 12s) [08:30:05] T327469: Subscribe buttons/links are displayed out of place due to new heading HTML structure - https://phabricator.wikimedia.org/T327469 [08:30:05] T327328: Highlight skips the topic container for new topics, which looks odd - https://phabricator.wikimedia.org/T327328 [08:30:08] MatmaRex: done [08:30:36] thanks Amir1! [08:33:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove openldap_corp role from ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/882573 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [08:34:33] (03CR) 10Zabe: [C: 03+2] Remove oversight group from privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882217 (https://phabricator.wikimedia.org/T112147) (owner: 10Zabe) [08:35:25] (03Merged) 10jenkins-bot: Remove oversight group from privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882217 (https://phabricator.wikimedia.org/T112147) (owner: 10Zabe) [08:36:19] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [08:36:37] (03PS1) 10Zabe: Start reading from cuc_comment_id on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882577 (https://phabricator.wikimedia.org/T233004) [08:36:53] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_comment_id on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882577 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [08:37:28] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 01m 08s) [08:37:37] (03Merged) 10jenkins-bot: Start reading from cuc_comment_id on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882577 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [08:37:56] !log zabe@deploy1002 Started scap: Backport for [[gerrit:882217|Remove oversight group from privileged groups (T112147)]], [[gerrit:882577|Start reading from cuc_comment_id on wikidatawiki (T233004)]] [08:38:01] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:38:01] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [08:39:37] !log zabe@deploy1002 zabe: Backport for [[gerrit:882217|Remove oversight group from privileged groups (T112147)]], [[gerrit:882577|Start reading from cuc_comment_id on wikidatawiki (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:40:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43226 and previous config saved to /var/cache/conftool/dbconfig/20230123-084045-root.json [08:40:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: After changing s1 sanitarium master', diff saved to https://phabricator.wikimedia.org/P43227 and previous config saved to /var/cache/conftool/dbconfig/20230123-084055-root.json [08:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 to vslow and dump group T326669', diff saved to https://phabricator.wikimedia.org/P43228 and previous config saved to /var/cache/conftool/dbconfig/20230123-084239-marostegui.json [08:42:43] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 to vslow and dump group T326669', diff saved to https://phabricator.wikimedia.org/P43229 and previous config saved to /var/cache/conftool/dbconfig/20230123-084326-marostegui.json [08:45:44] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:882217|Remove oversight group from privileged groups (T112147)]], [[gerrit:882577|Start reading from cuc_comment_id on wikidatawiki (T233004)]] (duration: 07m 48s) [08:45:49] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:45:49] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [08:46:22] !log volans@cumin1001 START - Cookbook sre.dns.netbox [08:47:35] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/882578 (https://phabricator.wikimedia.org/T327616) [08:48:28] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/882578 (https://phabricator.wikimedia.org/T327616) (owner: 10Marostegui) [08:48:37] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-create ripe-atlas-esams records as the host is back up - volans@cumin1001" [08:49:37] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-create ripe-atlas-esams records as the host is back up - volans@cumin1001" [08:49:37] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:01] <_joe_> taavi: do you know how I give +O to a user via ircservserv? The meta page has nothing, so checking before I read the sources [08:52:41] _joe_: not sure, but I wouldn't be surprised if there is not an option for that atm [08:53:53] <_joe_> yeah https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/irc/ircservserv/+/refs/heads/master/src/channel.rs [08:54:37] <_joe_> so yeah I guess I'll just make sirenbot ask chanserv for permissions where needed instead [09:07:52] _joe_: one option would be to grant it +t via the op rule, and then use `PRIVMSG ChanServ :TOPIC foo` instead of `TOPIC :foo` directly [09:15:36] (03CR) 10Filippo Giunchedi: [C: 03+1] Clarify ecs.version field format in docs [software/ecs] - 10https://gerrit.wikimedia.org/r/881809 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [09:16:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:16:14] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: enable filters for ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/881812 (https://phabricator.wikimedia.org/T326794) (owner: 10Cwhite) [09:16:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:16:35] (03CR) 10Filippo Giunchedi: [C: 03+1] conftool-data: add logstash[12]032 to kibana7 backend [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite) [09:17:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:17:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:19:59] (03CR) 10Filippo Giunchedi: "The patch itself looks good, not +1'ing yet (I've left a comment in the task)" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [09:21:51] (03PS5) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [09:21:59] (03CR) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [09:26:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think there's a couple small mistakes but LGTM otherwise." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [09:29:13] (03CR) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [09:32:00] (03CR) 10Hashar: [C: 03+2] wm-checks-api: fix TypeScript noImplicitAny [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/876212 (owner: 10Hashar) [09:32:17] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) a:03BTullis I will pick up this ticket, since I work with Jennifer on the Data Engineering team. [09:32:34] (03PS6) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [09:32:58] (03Merged) 10jenkins-bot: wm-checks-api: fix TypeScript noImplicitAny [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/876212 (owner: 10Hashar) [09:33:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [09:35:28] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) [09:40:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [09:41:13] btullis: <3 [09:41:55] claime: Thanks :-) [09:45:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10BTullis) [09:46:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) [09:47:19] <_joe_> jouncebot: nowandnext [09:47:19] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [09:47:19] In 1 hour(s) and 12 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1100) [09:47:27] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10BTullis) Apologies for the confusion. This is a duplicate of {T327406} where we have collected the necessary approval. [09:54:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:54:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:55:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:55:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2113.codfw.wmnet with reason: Maintenance [09:58:42] jouncebot: nowandnext [09:58:42] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [09:58:42] In 1 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1100) [09:58:50] (03PS2) 10Ladsgroup: Remove Flow as default in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877244 [09:58:58] (03CR) 10Ladsgroup: [C: 03+2] Remove Flow as default in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877244 (owner: 10Ladsgroup) [09:59:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877244 (owner: 10Ladsgroup) [09:59:41] (03Merged) 10jenkins-bot: Remove Flow as default in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877244 (owner: 10Ladsgroup) [09:59:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:877244|Remove Flow as default in techconductwiki]] [10:01:36] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:877244|Remove Flow as default in techconductwiki]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:03:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-tool1010.eqiad.wmnet with OS bullseye [10:07:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:877244|Remove Flow as default in techconductwiki]] (duration: 07m 51s) [10:12:51] (03PS9) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [10:12:55] 10SRE, 10Traffic, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) since this bug was reported back in 2019, our CDN stack has changed a little b... [10:13:02] (03CR) 10Giuseppe Lavagetto: Start using the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [10:16:33] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-tool1010.eqiad.wmnet with reason: host reimage [10:17:21] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10JEbe-WMF) [10:18:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-tool1010.eqiad.wmnet with reason: host reimage [10:21:02] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) [10:21:54] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [10:23:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10JEbe-WMF) [10:28:24] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:54] (03CR) 10Hnowlan: [C: 03+1] changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:31:54] hnowlan: <3 [10:35:04] (03PS1) 10Btullis: Grant production shell access to Jennifer Ebe [puppet] - 10https://gerrit.wikimedia.org/r/882596 (https://phabricator.wikimedia.org/T327406) [10:37:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-tool1010.eqiad.wmnet with OS bullseye [10:37:35] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Theklan) @Legoktm could you help me with this at euwiki? Thanks! [10:39:42] !log btullis@deploy1002 Installing scap version "4.33.1" for 1 hosts [10:39:52] !log btullis@deploy1002 Installation of scap version "4.33.1" completed for 1 hosts [10:40:05] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@4ba1cb1]: (no justification provided) [10:40:24] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@4ba1cb1]: (no justification provided) (duration: 00m 20s) [10:40:39] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@4ba1cb1]: (no justification provided) [10:40:44] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@4ba1cb1]: (no justification provided) (duration: 00m 06s) [10:46:32] (03CR) 10Elukey: [C: 03+2] changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:48:02] !log rolling upgrade to HAProxy 2.4.20 on ulsfo [10:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:32] (03PS8) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [10:48:57] (03CR) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [10:49:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [10:49:37] 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10ayounsi) p:05Triage→03Low [10:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2113 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P43230 and previous config saved to /var/cache/conftool/dbconfig/20230123-104951-ladsgroup.json [10:50:40] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/882260 (https://phabricator.wikimedia.org/T327644) [10:52:05] (03PS1) 10Btullis: Enable the two new cache types in superset production [puppet] - 10https://gerrit.wikimedia.org/r/882599 (https://phabricator.wikimedia.org/T323458) [10:52:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T327644 [10:54:15] T327644: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T327644 [10:54:16] (03PS2) 10Btullis: Grant production shell access to Jennifer Ebe [puppet] - 10https://gerrit.wikimedia.org/r/882596 (https://phabricator.wikimedia.org/T327406) [10:54:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T327644 [10:54:41] (03CR) 10CI reject: [V: 04-1] Grant production shell access to Jennifer Ebe [puppet] - 10https://gerrit.wikimedia.org/r/882596 (https://phabricator.wikimedia.org/T327406) (owner: 10Btullis) [10:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2129 with weight 0 T327644', diff saved to https://phabricator.wikimedia.org/P43231 and previous config saved to /var/cache/conftool/dbconfig/20230123-105520-ladsgroup.json [10:55:37] !log update management routers ACLs to add new bast hosts [10:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:03] (03PS2) 10Jbond: prometheus: decode utf-8 in puppet agent script [puppet] - 10https://gerrit.wikimedia.org/r/879957 (owner: 10Majavah) [10:56:20] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/881869 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [10:56:27] (03PS3) 10Btullis: Grant production shell access to Jennifer Ebe [puppet] - 10https://gerrit.wikimedia.org/r/882596 (https://phabricator.wikimedia.org/T327406) [10:56:39] (03CR) 10Btullis: [C: 03+2] Enable the two new cache types in superset production [puppet] - 10https://gerrit.wikimedia.org/r/882599 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis) [10:56:44] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/881837 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [10:57:55] (03CR) 10Jbond: [C: 03+2] "lgtm will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/879957 (owner: 10Majavah) [10:57:58] 10SRE-tools, 10Infrastructure-Foundations, 10netops: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10Volans) This is a draft of a possible one-off script that can be run within homer's venv to gather the FQDNs to test, attempt a connection and grab the fingerpri... [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1100) [11:01:33] (03CR) 10Btullis: [C: 03+2] Grant production shell access to Jennifer Ebe [puppet] - 10https://gerrit.wikimedia.org/r/882596 (https://phabricator.wikimedia.org/T327406) (owner: 10Btullis) [11:01:42] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) fyi we now have OIDC support in production, currently been tested by @SLyngshede-WMF [11:01:54] (03CR) 10Muehlenhoff: [C: 03+2] Move ping offload from ping2002 to ping2003 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/881837 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [11:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2113 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P43232 and previous config saved to /var/cache/conftool/dbconfig/20230123-110456-ladsgroup.json [11:07:45] (03CR) 10Jbond: [C: 03+1] "lgtm ping me on irc (after lunch as catching up on things) and i can deploy" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [11:07:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) [] Merge access grant [] Create kerberos principal [11:08:50] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/882600 (https://phabricator.wikimedia.org/T327187) (owner: 10Clément Goubert) [11:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:11:36] (03CR) 10Jbond: hieradata: add wmcs-roots to clouddumps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [11:11:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2107.codfw.wmnet with reason: Maintenance [11:11:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2107.codfw.wmnet with reason: Maintenance [11:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2107 (T323827)', diff saved to https://phabricator.wikimedia.org/P43233 and previous config saved to /var/cache/conftool/dbconfig/20230123-111147-ladsgroup.json [11:11:51] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [11:12:20] PROBLEM - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is CRITICAL: SSL CRITICAL - Certificate sessionstore2002-a valid until 2023-02-22 11:12:16 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:12:24] PROBLEM - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - Certificate sessionstore2001-a valid until 2023-02-22 11:12:13 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:12:40] PROBLEM - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - Certificate sessionstore1002-a valid until 2023-02-22 11:12:08 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:12:50] PROBLEM - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - Certificate sessionstore1003-a valid until 2023-02-22 11:12:10 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:13:36] PROBLEM - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - Certificate sessionstore1001-a valid until 2023-02-22 11:12:05 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:13:48] PROBLEM - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is CRITICAL: SSL CRITICAL - Certificate sessionstore2003-a valid until 2023-02-22 11:12:18 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [11:15:12] (03PS1) 10Vgutierrez: acme-chief: Restrict challenge type to valid ones [puppet] - 10https://gerrit.wikimedia.org/r/882602 (https://phabricator.wikimedia.org/T326942) [11:16:21] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert Hi @Kappakayala, Please read and sign the [[ https://phabricator.wikimedia.org/L3 | Acknowledgement of Wikimed... [11:16:38] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Clement_Goubert) [11:16:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39202/console" [puppet] - 10https://gerrit.wikimedia.org/r/882602 (https://phabricator.wikimedia.org/T326942) (owner: 10Vgutierrez) [11:17:47] !log Starting s6 codfw failover from db2114 to db2129 - T327644 [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] T327644: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T327644 [11:18:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2129 to s6 primary T327644', diff saved to https://phabricator.wikimedia.org/P43234 and previous config saved to /var/cache/conftool/dbconfig/20230123-111813-ladsgroup.json [11:18:30] (03PS2) 10Ladsgroup: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/882260 (https://phabricator.wikimedia.org/T327644) (owner: 10Gerrit maintenance bot) [11:18:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/882260 (https://phabricator.wikimedia.org/T327644) (owner: 10Gerrit maintenance bot) [11:19:14] (03PS1) 10Cathal Mooney: Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 [11:19:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/882602 (https://phabricator.wikimedia.org/T326942) (owner: 10Vgutierrez) [11:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2113 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P43235 and previous config saved to /var/cache/conftool/dbconfig/20230123-112001-ladsgroup.json [11:21:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2114 T327644', diff saved to https://phabricator.wikimedia.org/P43236 and previous config saved to /var/cache/conftool/dbconfig/20230123-112134-ladsgroup.json [11:22:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [11:22:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [11:22:38] (03PS2) 10Cathal Mooney: Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 [11:23:48] (03PS3) 10Cathal Mooney: Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 [11:24:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/882602 (https://phabricator.wikimedia.org/T326942) (owner: 10Vgutierrez) [11:24:59] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme-chief: Restrict challenge type to valid ones [puppet] - 10https://gerrit.wikimedia.org/r/882602 (https://phabricator.wikimedia.org/T326942) (owner: 10Vgutierrez) [11:27:35] (03PS1) 10Giuseppe Lavagetto: flink-app: use proper json [deployment-charts] - 10https://gerrit.wikimedia.org/r/882612 [11:28:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) I have merged the changes to `data.yaml` so Jennifer should now have production shell access and access to the... [11:28:18] (03PS1) 10Clément Goubert: admin: Grant Muhammad Jaziraly access to analytics data [puppet] - 10https://gerrit.wikimedia.org/r/882613 (https://phabricator.wikimedia.org/T327172) [11:28:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [11:29:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Clement_Goubert) [] OOB SSH key validation [] Merge access grant [] Create kerberos principal [11:31:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [11:31:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [11:32:50] (03CR) 10Michael Große: [C: 04-1] "This change should only be deployed after it was greenlit by the wmde-internal stage-gate meeting (scheduled on Tuesday)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882615 (https://phabricator.wikimedia.org/T324999) (owner: 10Michael Große) [11:33:34] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Clement_Goubert) 05In progress→03Resolved [11:33:59] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) The work is completed. I'll work with @JEbe-WMF to verify access. [11:34:21] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Requesting access to Data Engineering team resources for Jennifer Ebe - https://phabricator.wikimedia.org/T327406 (10BTullis) 05Open→03Resolved p:05Triage→03Medium [11:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2113 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P43239 and previous config saved to /var/cache/conftool/dbconfig/20230123-113506-ladsgroup.json [11:35:36] ACKNOWLEDGEMENT - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service MVernon https://phabricator.wikimedia.org/T327253 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/881598 (owner: 10Muehlenhoff) [11:37:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [11:37:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) @odimitrijevic @Ottomata Can I get your approval on this please? [11:40:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10Clement_Goubert) @HXi-WMF, could you please confirm that we can proceed with the account renaming? [11:45:49] (03CR) 10Ayounsi: [C: 03+1] Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 (owner: 10Cathal Mooney) [11:47:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881649 (https://phabricator.wikimedia.org/T327408) (owner: 10Volans) [11:47:10] (03CR) 10Cathal Mooney: [C: 03+2] Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 (owner: 10Cathal Mooney) [11:47:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/881650 (owner: 10Volans) [11:48:18] (03Merged) 10jenkins-bot: Remove atlas-ulsfo from cr-border-in.pol as it's not live [homer/public] - 10https://gerrit.wikimedia.org/r/882605 (owner: 10Cathal Mooney) [11:48:57] (03CR) 10Clément Goubert: admin/canary_appserver: add group of users allowed to disable puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [11:50:50] (03CR) 10ArielGlenn: [C: 03+1] "LGTM but per irc conversation WMCS folks should really give the thumbs up" [puppet] - 10https://gerrit.wikimedia.org/r/881386 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:51:06] (03CR) 10ArielGlenn: [C: 03+1] "LGTM but per irc conversation WMCS folks should really give the thumbs up" [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:51:21] (03CR) 10ArielGlenn: [C: 03+1] "LGTM but per irc conversation WMCS folks should really give the thumbs up" [puppet] - 10https://gerrit.wikimedia.org/r/881399 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:51:40] (03CR) 10ArielGlenn: [C: 03+1] "LGTM but per irc conversation WMCS folks should really give the thumbs up" [puppet] - 10https://gerrit.wikimedia.org/r/881393 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:51:57] (03CR) 10ArielGlenn: [C: 03+1] "LGTM but per irc conversation WMCS folks should really give the thumbs up" [puppet] - 10https://gerrit.wikimedia.org/r/881413 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:52:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:55:35] 10SRE, 10LDAP-Access-Requests: Grant Access to Wmf group for MShilova - https://phabricator.wikimedia.org/T327546 (10Clement_Goubert) 05Open→03In progress a:03Clement_Goubert [11:56:10] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) The timer job ran this morning, with our less picky settings, and ended thus: ` [...] Jan 23 10:24:35 ms-be1069 swift-rclone-sync[1539164]: ERROR : wikipedia-de-local-public.... [11:57:06] !log Reboot db2132 (m1 codfw master) [11:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:29] (03CR) 10Jbond: Fix xihua's account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [11:57:49] !log dbmaint Reboot db2132 (m1 codfw master) [11:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:22] !log dbmaint Reboot db2133 (m2 codfw master) [11:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:33] (03CR) 10Majavah: hieradata: add wmcs-roots to clouddumps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [12:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T323827)', diff saved to https://phabricator.wikimedia.org/P43241 and previous config saved to /var/cache/conftool/dbconfig/20230123-120012-ladsgroup.json [12:00:16] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:03:15] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:03:24] (03CR) 10Jbond: openstack: encapi: create parent directories for files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881711 (owner: 10Majavah) [12:04:15] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:05:08] !log removing /usr/local/bin/prometheus-puppet-agent-stats from prometheus crontab on snapshot1014 [12:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:21] (03CR) 10Clément Goubert: Fix xihua's account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [12:06:39] !log dbmaint Reboot db2134 (m3 codfw master) [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] !log dbmaint Reboot db2135 (m5 codfw master) [12:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:27] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:08:05] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:08:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:08:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:10:08] (03CR) 10Clément Goubert: Fix xihua's account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [12:10:09] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:11:18] (03PS2) 10Clément Goubert: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [12:11:44] (03CR) 10Jbond: P:gitlab: manage gitlab with gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [12:11:45] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:11:52] (03Abandoned) 10Jbond: P:gitlab: manage gitlab with gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [12:12:02] (03CR) 10CI reject: [V: 04-1] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [12:15:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P43242 and previous config saved to /var/cache/conftool/dbconfig/20230123-121519-ladsgroup.json [12:22:25] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:22:31] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:22:37] (03PS5) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 (https://phabricator.wikimedia.org/T320534) [12:22:59] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:22:59] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Picking one of those to go log-diving (via the hacky `sudo cumin O:swift::proxy 'grep Symbol_Limes.png /var/log/swift/proxy-access.log || true'`) gets 3 hits, one of which is... [12:23:33] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:01] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:07] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:07] (03PS1) 10Clément Goubert: admin: Add Mariya Shilova to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/882644 (https://phabricator.wikimedia.org/T327546) [12:24:35] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:25:09] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P43245 and previous config saved to /var/cache/conftool/dbconfig/20230123-123025-ladsgroup.json [12:31:09] 10SRE, 10Acme-chief, 10Traffic: Ci check for acme-chief changes - https://phabricator.wikimedia.org/T326942 (10Vgutierrez) p:05Triage→03Low this has been mitigated by https://gerrit.wikimedia.org/r/882602, invalid challenge types will now trigger a puppet compilation failure [12:36:07] (03CR) 10Jbond: [C: 03+1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [12:38:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) p:05Triage→03Medium [12:38:41] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Clement_Goubert) p:05Triage→03Medium [12:38:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Clement_Goubert) p:05Triage→03Medium [12:39:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) p:05Triage→03Medium [12:41:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Wmf group for MShilova - https://phabricator.wikimedia.org/T327546 (10Clement_Goubert) p:05Triage→03Medium [] Merge CR [] Grant LDAP group access [12:43:10] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Clement_Goubert) 05Resolved→03In progress [12:43:38] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [12:45:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T323827)', diff saved to https://phabricator.wikimedia.org/P43246 and previous config saved to /var/cache/conftool/dbconfig/20230123-124532-ladsgroup.json [12:45:36] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [12:45:47] (03CR) 10Hnowlan: [C: 03+1] "lgtm, one query" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [12:50:42] (03CR) 10Jelto: [C: 03+2] gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:51:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/882600 (https://phabricator.wikimedia.org/T327187) (owner: 10Clément Goubert) [12:51:53] (03PS2) 10Majavah: ldap: move ssh-key-ldap-lookup directly to ssh module [puppet] - 10https://gerrit.wikimedia.org/r/877964 [12:52:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/882613 (https://phabricator.wikimedia.org/T327172) (owner: 10Clément Goubert) [12:53:00] (03CR) 10Majavah: ldap: move ssh-key-ldap-lookup directly to ssh module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [12:53:17] (03CR) 10CI reject: [V: 04-1] ldap: move ssh-key-ldap-lookup directly to ssh module [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [12:53:27] 10SRE, 10observability, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10Clement_Goubert) [12:53:29] 10SRE: Update Media dashboard in Grafana to use Prometheus metrics - https://phabricator.wikimedia.org/T193445 (10Clement_Goubert) 05Open→03Invalid The link in the task description 404s. Being bold and closing as Invalid, feel free to reopen with up to date information if needed. [12:55:46] (03CR) 10Jbond: [C: 04-1] "i have added this as approval in the next IF meeting (today) will update after the meeting" [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [12:55:48] (03CR) 10Clément Goubert: [C: 03+2] admin: Grant ollieshotton access to analytics data [puppet] - 10https://gerrit.wikimedia.org/r/882600 (https://phabricator.wikimedia.org/T327187) (owner: 10Clément Goubert) [12:56:38] (03PS3) 10Majavah: ldap: move ssh-key-ldap-lookup directly to ssh module [puppet] - 10https://gerrit.wikimedia.org/r/877964 [12:58:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39204/console" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [12:59:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) 05In progress→03Resolved @Ollie.Shotton_WMDE your access to the relevant groups has been granted. Please wait 30m (as of this comment) be... [13:02:03] (03PS2) 10Clément Goubert: admin: Grant Muhammad Jaziraly access to analytics data [puppet] - 10https://gerrit.wikimedia.org/r/882613 (https://phabricator.wikimedia.org/T327172) [13:03:39] (03CR) 10Jbond: Fix xihua's account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [13:04:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/882644 (https://phabricator.wikimedia.org/T327546) (owner: 10Clément Goubert) [13:04:56] (03PS2) 10Clément Goubert: admin: Add Mariya Shilova to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/882644 (https://phabricator.wikimedia.org/T327546) [13:06:34] (03CR) 10Brian Wolff: Force users with passwords shorter than 8 characters to change it (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882232 (https://phabricator.wikimedia.org/T285151) (owner: 10Zabe) [13:06:56] (03CR) 10Clément Goubert: [C: 03+2] admin: Add Mariya Shilova to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/882644 (https://phabricator.wikimedia.org/T327546) (owner: 10Clément Goubert) [13:08:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39207/console" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [13:09:43] (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [13:14:55] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/860837/39206/" [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:15:50] (03PS3) 10Clément Goubert: admin: Grant Muhammad Jaziraly access to analytics data [puppet] - 10https://gerrit.wikimedia.org/r/882613 (https://phabricator.wikimedia.org/T327172) [13:16:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Wmf group for MShilova - https://phabricator.wikimedia.org/T327546 (10Clement_Goubert) 05In progress→03Resolved @MShilova_WMF your access to the wmf group has been granted. Please wait 30m (as of this comment) before trying it out as the... [13:16:24] (03PS4) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) [13:16:43] (03CR) 10CI reject: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [13:17:25] (03CR) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [13:18:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Clement_Goubert) You should have received an email regarding Kerberos, you can follow the instructions on there to set your credentials. If you didn't, please... [13:19:30] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10jbond) @BCornwall thanks for reviving this. i think that this ultimately stalled as there was a questions of wether it would be usefull. from... [13:20:17] (03CR) 10Clément Goubert: [C: 03+2] admin: Grant Muhammad Jaziraly access to analytics data [puppet] - 10https://gerrit.wikimedia.org/r/882613 (https://phabricator.wikimedia.org/T327172) (owner: 10Clément Goubert) [13:22:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Clement_Goubert) 05In progress→03Resolved @Muhammad_Yasser_Jazirahly_WMDE your access to the relevant groups has been granted. Please wait 30m (as of... [13:23:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Muhammad_Yasser_Jazirahly_WMDE) Many thanks @Clement_Goubert [13:28:23] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:46] 10SRE, 10Acme-chief, 10Traffic: Ci check for acme-chief changes - https://phabricator.wikimedia.org/T326942 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:40:07] (03CR) 10Ottomata: [C: 03+2] flink-app: use proper json [deployment-charts] - 10https://gerrit.wikimedia.org/r/882612 (owner: 10Giuseppe Lavagetto) [13:45:13] (03Merged) 10jenkins-bot: flink-app: use proper json [deployment-charts] - 10https://gerrit.wikimedia.org/r/882612 (owner: 10Giuseppe Lavagetto) [13:57:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1400) [14:00:05] sbailey, cirno, and Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:18] I am here [14:00:42] here [14:01:24] Hello, I would like to ask where to see the deployment status for the CX Server https://cxserver.wikimedia.org ( https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/cxserver ) for the bug fix https://gerrit.wikimedia.org/r/c/882173 ( https://phabricator.wikimedia.org/T129470 ). Thanks. [14:01:46] (03PS1) 10Jbond: admin: data_tests improve error messages and correct typos [puppet] - 10https://gerrit.wikimedia.org/r/882648 [14:02:11] (03PS1) 10Ottomata: Add to admin_ng/README.md on how to deploy limiting the release [deployment-charts] - 10https://gerrit.wikimedia.org/r/882649 [14:03:30] (03CR) 10Jbond: [C: 03+2] admin: data_tests improve error messages and correct typos [puppet] - 10https://gerrit.wikimedia.org/r/882648 (owner: 10Jbond) [14:05:12] (03PS3) 10Jbond: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:05:50] (03CR) 10CI reject: [V: 04-1] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:07:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:39] (03PS4) 10Jbond: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:10:11] (03CR) 10CI reject: [V: 04-1] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:10:13] I can deploy in a few minutes [14:10:30] 10-4 [14:10:35] ;-) [14:12:27] Winston_Sung[m]: operations/deployment-charts.git [14:12:58] (03CR) 10Majavah: [C: 03+2] SpecialUserrights: Allow updating the expiry of user groups [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882179 (https://phabricator.wikimedia.org/T327605) (owner: 10Func) [14:14:25] taavi: Thanks. [14:14:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882422 (https://phabricator.wikimedia.org/T323131) (owner: 10Stang) [14:14:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882425 (https://phabricator.wikimedia.org/T327380) (owner: 10Stang) [14:14:40] (03CR) 10Elukey: [C: 03+2] helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [14:14:49] (03PS2) 10Majavah: shnwikibooks: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882425 (https://phabricator.wikimedia.org/T327380) (owner: 10Stang) [14:14:53] (03CR) 10Majavah: [C: 03+2] shnwikibooks: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882425 (https://phabricator.wikimedia.org/T327380) (owner: 10Stang) [14:15:18] (03Merged) 10jenkins-bot: bnwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882422 (https://phabricator.wikimedia.org/T323131) (owner: 10Stang) [14:15:39] (03Merged) 10jenkins-bot: shnwikibooks: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882425 (https://phabricator.wikimedia.org/T327380) (owner: 10Stang) [14:15:57] sbailey: was 880989 tested in beta or smaller wikis before being rolled out globally? [14:16:04] yes [14:16:17] !log taavi@deploy1002 Started scap: Backport for [[gerrit:882422|bnwikiquote: Update logo (T323131)]], [[gerrit:882425|shnwikibooks: Add project logo (T327380)]] [14:16:24] T323131: New localized logo for bn.wikquote - https://phabricator.wikimedia.org/T323131 [14:16:24] T327380: Change Logo on shn.wikibooks.org - https://phabricator.wikimedia.org/T327380 [14:17:01] (03CR) 10Alexandros Kosiaris: Fix xihua's account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:17:56] !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:882422|bnwikiquote: Update logo (T323131)]], [[gerrit:882425|shnwikibooks: Add project logo (T327380)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:18:06] cirno: please test the logo patches [14:18:10] looking [14:18:55] !log mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=zhwiki pageassessments # T326387 [14:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [14:19:35] taavi, both two looks good to me [14:19:42] thanks, syncing [14:19:46] *look [14:20:05] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:20:49] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:22:10] Is there any scheduled time to update cxserver or it depends on request? [14:23:51] (03PS5) 10Alexandros Kosiaris: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) [14:24:28] (03PS4) 10Majavah: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387) (owner: 10Stang) [14:24:30] (03PS6) 10Jbond: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:24:32] (03PS1) 10Jbond: admin: Add check for duplicate uid's [puppet] - 10https://gerrit.wikimedia.org/r/882652 [14:24:34] (03CR) 10Majavah: [C: 03+2] zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387) (owner: 10Stang) [14:24:53] (03CR) 10Muehlenhoff: [C: 03+2] Move ping offload from ping1002 to ping1003 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/881869 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [14:25:10] (03CR) 10CI reject: [V: 04-1] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:25:12] (03CR) 10CI reject: [V: 04-1] admin: Add check for duplicate uid's [puppet] - 10https://gerrit.wikimedia.org/r/882652 (owner: 10Jbond) [14:25:16] (03Merged) 10jenkins-bot: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387) (owner: 10Stang) [14:25:23] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:25:34] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:25:39] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:882422|bnwikiquote: Update logo (T323131)]], [[gerrit:882425|shnwikibooks: Add project logo (T327380)]] (duration: 09m 22s) [14:25:44] T323131: New localized logo for bn.wikquote - https://phabricator.wikimedia.org/T323131 [14:25:44] T327380: Change Logo on shn.wikibooks.org - https://phabricator.wikimedia.org/T327380 [14:25:58] !log taavi@deploy1002 Started scap: Backport for [[gerrit:876196|zhwiki: Install PageAssessments (T326387)]] [14:26:01] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [14:26:01] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [14:26:09] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [14:27:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Interestingly, I can not re-use the same uid (which actually makes sense) but also the fact that hpham and phamhi (both absented) have uid" [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:27:38] !log taavi@deploy1002 stang and taavi: Backport for [[gerrit:876196|zhwiki: Install PageAssessments (T326387)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:27:50] cirno: please test the pageassessments one [14:27:58] looking [14:30:16] (03Merged) 10jenkins-bot: SpecialUserrights: Allow updating the expiry of user groups [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882179 (https://phabricator.wikimedia.org/T327605) (owner: 10Func) [14:31:12] taavi, the magic word "{{#assessment}}" starts working, and special page Special:PageAssessments exist, so LGTM [14:31:32] thanks, syncing [14:32:27] taavi, could you please flush the caches of two logos? thanks [14:32:49] oh right, good point. give me a second [14:33:00] (03CR) 10Hashar: [C: 04-1] "Looks good, there is two minor issues though:" [puppet] - 10https://gerrit.wikimedia.org/r/860837 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:33:15] oops, only bnwikiquote is needed [14:34:01] {{done}} [14:36:03] Is 880989 getting deployed? it have been in beta for over a month? [14:36:43] sbailey: yes, I'm dealing with other patches at the moment, yours is still in the queue [14:36:57] (03PS1) 10Elukey: changeprop: fix uri in liftwing's template [deployment-charts] - 10https://gerrit.wikimedia.org/r/882654 (https://phabricator.wikimedia.org/T327302) [14:37:08] thx, new to backport proces [14:37:22] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:876196|zhwiki: Install PageAssessments (T326387)]] (duration: 11m 24s) [14:37:26] T326387: Deploy PageAssessments to Chinese Wikipedia - https://phabricator.wikimedia.org/T326387 [14:37:36] Func: yours is up next [14:37:45] :) [14:37:49] ok [14:37:52] !log taavi@deploy1002 Started scap: Backport for [[gerrit:882179|SpecialUserrights: Allow updating the expiry of user groups (T327605)]] [14:37:55] T327605: Special:UserRights: changing an already set permission's expiry to any new value fails - https://phabricator.wikimedia.org/T327605 [14:39:30] !log taavi@deploy1002 taavi and func: Backport for [[gerrit:882179|SpecialUserrights: Allow updating the expiry of user groups (T327605)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:39:39] sbailey: ah, sorry. we deployers generally change the order to be as quick as possible. I had to ask a question about 882179 so I couldn't start with it, and I had already +2'd F.unc's patch to save time on the core CI and it had merged in the mean time so I need to do that before I can get to yours [14:39:51] Func: can you test yours on a mwdebug server please? [14:39:58] I don't have sufficient rights to test on prod, but this simple patch should just works. [14:40:25] sbailey: in the meantime: do you have the x-wikimedia-debug extension installed? [14:40:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! And TIL :)" [homer/public] - 10https://gerrit.wikimedia.org/r/877202 (https://phabricator.wikimedia.org/T325806) (owner: 10Ayounsi) [14:41:04] thanks for the explaination, very apprciative of your comments. Happy to watch. Have more patches to backport i the coming weeks that are trickier, such as two data migration patches [14:41:21] Func: ack. I gave it a quick test on testwiki just in case to not break stuff, works fine so deploying. [14:42:16] the x-wikimedia-debug extension will not help e with this patch being verified. I need to use Quarry and actually look at error log and create pages with lint errors and see them show up in preports. [14:42:20] !log rolling out pybal 1.15.10: T321191 [14:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:24] T321191: Cleanup pybal Prometheus metrics on monitor stop() - https://phabricator.wikimedia.org/T321191 [14:43:00] sbailey: hmm. how/when are the rows inserted into the databases? [14:43:35] taavi, pretty fast, but as part of a job that is invoked by VE and standard editor [14:43:54] Linter recordLintJob [14:43:56] ah, it's a job? yeah, it can't be tested with x-wm-d then :/ [14:44:05] I know, annoying [14:44:21] oh well [14:45:18] part of the reparsing code path of parsoid [14:46:39] parsoid queues up a bunch of linter error records when it reparses a page, then through a hook the job runs usually pretty quickly [14:46:41] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:882179|SpecialUserrights: Allow updating the expiry of user groups (T327605)]] (duration: 08m 48s) [14:46:45] T327605: Special:UserRights: changing an already set permission's expiry to any new value fails - https://phabricator.wikimedia.org/T327605 [14:47:05] in that case in the future please split the changes to multiple patches (for example group0 first, then group1 and finally all wikis) since that creates a much smaller blast radius if something goes wrong. I can do it this way this time, but for the future that's much easier to deploy [14:47:16] (03PS2) 10Jbond: admin: Add check for duplicate uid's [puppet] - 10https://gerrit.wikimedia.org/r/882652 [14:47:19] (03PS5) 10Majavah: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:47:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:47:40] (03PS7) 10Jbond: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:47:49] this is a very safe change, if it were more dangerous I would have done more stages [14:48:20] (03CR) 10CI reject: [V: 04-1] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:48:56] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/882652 (owner: 10Jbond) [14:49:55] (03CR) 10Vgutierrez: [C: 03+1] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [14:50:59] (03CR) 10Majavah: [C: 03+2] Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:51:09] (03CR) 10Ssingh: [C: 03+2] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [14:51:27] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mlitn-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:56] (03Merged) 10jenkins-bot: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:52:09] :-) [14:52:11] !log taavi@deploy1002 Started scap: Backport for [[gerrit:880989|Enable Linter write namespace tag and template using core config (T299612)]] [14:52:15] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [14:52:32] Testing [14:53:04] testing what exactly? the patch is still not deployed anywhere [14:53:19] ? [14:53:32] Ah sync [14:53:44] yeah, it takes a while these days [14:53:44] (03CR) 10Elukey: [C: 03+2] changeprop: fix uri in liftwing's template [deployment-charts] - 10https://gerrit.wikimedia.org/r/882654 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [14:53:47] !log taavi@deploy1002 taavi and sbailey: Backport for [[gerrit:880989|Enable Linter write namespace tag and template using core config (T299612)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:55:07] (03PS1) 10Hashar: puppet_compiler: serve pson.gz as application/json [puppet] - 10https://gerrit.wikimedia.org/r/882656 [14:56:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [14:56:35] (03CR) 10Vgutierrez: [C: 03+1] "looking good, please fix the mentioned typo on the changelog" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [14:56:59] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10Volans) I've set the device back to active to reflect its current status and prevent some warnings to show up in the `sre.dns.netbox` cookbook runs. [14:57:32] So, is there any scheduled time to update the CX Server or it is required to fill a request somewhere? [14:57:55] (03PS2) 10Ssingh: Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) [14:58:07] (03CR) 10Ssingh: Release 6.0.11-1wm1 (031 comment) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [14:58:15] Winston_Sung[m]: if there was a scheduled time it would be listed on https://wikitech.wikimedia.org/wiki/Deployments, and if there is not you need to ask the cxserver maintainers somewhere else [14:58:37] (03PS8) 10Jbond: Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [14:58:49] (03CR) 10Ladsgroup: "I'm too late for this now but for future cases, please enable it on a set of test wikis and then make sure it doesn't break anything and t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [14:58:57] Ok, thanks for the response. [14:59:27] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:59:34] PROBLEM - MariaDB Replica SQL: s2 #page on db1105 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column linter_template in field list on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:38] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:59:40] checking [14:59:43] Amir1: ^ [14:59:50] not me [14:59:54] let me depool [14:59:59] sigh, that looks very related to the current deployment [15:00:02] should I revert? [15:00:08] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:880989|Enable Linter write namespace tag and template using core config (T299612)]] (duration: 07m 56s) [15:00:10] taavi: very likely [15:00:11] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [15:00:13] please revert [15:00:14] PROBLEM - MariaDB Replica SQL: s7 #page on db1170 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column linter_template in field list on query. Default database: metawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:00:16] sure, doing [15:00:19] sorry :/ [15:00:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P43247 and previous config saved to /var/cache/conftool/dbconfig/20230123-150018-marostegui.json [15:00:20] Amir1, the write code was running on Beta since mid december 880989 [15:00:21] taavi: revert [15:00:21] <_joe_> taavi: revert, yes [15:00:27] (03PS1) 10TrainBranchBot: Revert "Enable Linter write namespace tag and template using core config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882661 [15:00:29] (03CR) 10TrainBranchBot: "taavi@deploy1002 created a revert of this change as I76ef30bfd05fe069b2715e1933e8b81723149187" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [15:00:29] doing [15:00:33] sbailey: beta and production dbs are different [15:00:35] maybe those hosts didn't get the column? [15:00:39] beta works with update.php [15:00:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882661 (owner: 10TrainBranchBot) [15:00:44] hey [15:00:46] marostegui: yeah, that's my guess [15:00:52] <_joe_> let's wait to talk about what went wrong until things are stable [15:00:54] I can add it quickly [15:00:55] (03PS2) 10Ottomata: Add to admin_ng/README.md on how to deploy limiting the release [deployment-charts] - 10https://gerrit.wikimedia.org/r/882649 [15:00:57] (03PS1) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) [15:01:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170:3317', diff saved to https://phabricator.wikimedia.org/P43248 and previous config saved to /var/cache/conftool/dbconfig/20230123-150110-marostegui.json [15:01:18] sounds like it's handled! [15:01:18] both hosts are now depooled [15:01:27] <_joe_> bblack: it's ongoing [15:01:31] sorry about this [15:01:50] <_joe_> taavi: are you taking care of the rollback? [15:01:55] (03Merged) 10jenkins-bot: Revert "Enable Linter write namespace tag and template using core config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882661 (owner: 10TrainBranchBot) [15:01:57] yes, I am rolling the mediawiki changes back [15:02:10] !log taavi@deploy1002 Started scap: Backport for [[gerrit:882661|Revert "Enable Linter write namespace tag and template using core config"]] [15:02:15] yup, it's the linter error [15:02:15] thanks for confirming [15:02:16] <_joe_> ok, thanks [15:02:23] Last_Error: Error 'Unknown column 'linter_template' in 'field list'' on query. Default database: 'metawiki'. Query: 'INSERT /* MediaWiki\Linter\Database::setForPage */ IGNORE INTO `> [15:02:29] yeah the column isn't present [15:02:33] new column didn't exist in prod dbs yet? [15:02:33] I am going to add them on db1105 and db1170 [15:02:35] gradual rollout people, please [15:02:36] ok [15:02:45] the hosts are not serving traffic now [15:02:48] so we should be good [15:02:53] yeah [15:02:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/882652 (owner: 10Jbond) [15:03:02] (03CR) 10Jbond: admin/canary_appserver: add group of users allowed to disable puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [15:03:02] I will add it and let you know taavi [15:03:29] marostegui: I'll revert it anyways, it can be re-enabled at some later window [15:03:34] taavi: sounds good [15:03:48] !log taavi@deploy1002 taavi and trainbranchbot: Backport for [[gerrit:882661|Revert "Enable Linter write namespace tag and template using core config"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [15:04:04] <_joe_> taavi: I'd release everywhere tbh [15:04:08] (03PS2) 10Hashar: puppet_compiler: serve pson.gz as application/json [puppet] - 10https://gerrit.wikimedia.org/r/882656 [15:04:22] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [15:06:14] taavi: now that it's reverted, please do gradual roll out, first testwikis, then one section, etc. [15:06:37] sbailey: ^ [15:07:08] <_joe_> can we claim the incident is over? [15:07:16] Ok, how do I verify all databases have had the 3 columns added? [15:07:51] Yes will figure out how to do more gradual roll out [15:07:51] it's not possible manually, you can do a drift report [15:08:09] https://drift-tracker.toolforge.org/report/core/ [15:08:16] db1105:3312 is now fixed [15:08:19] https://drift-tracker.toolforge.org/report/flaggedrevs/ [15:08:20] I am fixing db1170:3317 [15:09:12] RECOVERY - MariaDB Replica SQL: s2 #page on db1105 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:09:39] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:882661|Revert "Enable Linter write namespace tag and template using core config"]] (duration: 07m 28s) [15:09:44] revert was finally synced [15:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:11:18] (03CR) 10Hashar: "The compile has been triggered for `pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud` which is a noop https://puppet-compiler.wmflabs.or" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [15:11:40] (03PS3) 10Hashar: puppet_compiler: serve pson.gz as application/json [puppet] - 10https://gerrit.wikimedia.org/r/882656 [15:11:53] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [15:13:02] PROBLEM - MariaDB Replica Lag: s7 #page on db1170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 892.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:13:31] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/882656/1584/ and the diff is https://puppet-compiler.wmflabs.org/output/882656/1584/pcc-db1" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [15:14:02] _joe_: Is there an incident doc? Is it necessary to create one for this? [15:14:39] brett: probably not need to [15:14:48] Was it just two machines that didn't have the columns? [15:14:55] looks so for now yes [15:15:02] I'm running linter drift report to see geenrally what could be wrong [15:15:11] db1170 should be fixed now [15:15:29] Can we deploy this if it was just 2 machines? [15:15:35] (03PS1) 10BBlack: Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285) [15:15:41] sbailey: no, let's make sure it was just those two [15:15:41] it'll take a bit of time [15:15:47] ok [15:16:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 5%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43250 and previous config saved to /var/cache/conftool/dbconfig/20230123-151611-root.json [15:16:16] RECOVERY - MariaDB Replica Lag: s7 #page on db1170 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:16:22] RECOVERY - MariaDB Replica SQL: s7 #page on db1170 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:16:40] (03PS2) 10BBlack: Possibly mitigate ATS bug with semicolon in Path [puppet] - 10https://gerrit.wikimedia.org/r/882663 (https://phabricator.wikimedia.org/T238285) [15:16:42] I am now repooling both hosts [15:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 5%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43251 and previous config saved to /var/cache/conftool/dbconfig/20230123-151642-root.json [15:17:20] !log reprepro -C main include bullseye-wikimedia trafficserver_9.1.4-1wm1_amd64.changes: T325563 [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:24] T325563: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 [15:17:42] (03PS1) 10Bking: wdqs: mount NFS to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) [15:19:39] sbailey: FWIW, I'm seeing drift on linter_params in every wiki: [15:19:44] https://www.irccloud.com/pastebin/IsQT61W4/ [15:20:17] this means the field is nullable in code but not production or other way around [15:21:11] Ah, hmm. How can beta be ok but others not? [15:21:16] (03CR) 10CI reject: [V: 04-1] Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [15:21:20] (03CR) 10JMeybohm: "Could you be more explicit and allow access to those ports by only the spawned job pods?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:22:02] Amir1, can we chat on slack offline so I can fix/understand how this might happen? [15:22:34] sure [15:23:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [15:26:20] (03CR) 10Ssingh: "Updated typo, ignoring the build failure as expected." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [15:28:31] (03CR) 10DCausse: "seems like wdqs1010 is missing from ferm" [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:31:00] (03PS2) 10Bking: wdqs: mount NFS to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) [15:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 10%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43252 and previous config saved to /var/cache/conftool/dbconfig/20230123-153116-root.json [15:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 10%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43253 and previous config saved to /var/cache/conftool/dbconfig/20230123-153147-root.json [15:32:00] (03CR) 10Bking: wdqs: mount NFS to new hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:32:53] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [15:34:42] (03CR) 10DCausse: [C: 03+1] wdqs: mount NFS to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:35:54] (03CR) 10Bking: wdqs: mount NFS to new hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:37:30] (03PS1) 10Vgutierrez: Stop parsing semi-colon as a URL path delimiter [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/882667 [15:37:34] (03PS1) 10Elukey: changeprop: fix liftwing's body settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/882668 (https://phabricator.wikimedia.org/T327302) [15:40:29] marostegui: is it ok if i ship a sec patch now? or should i wait a bit for the DB fixes to be finished? [15:40:57] (03CR) 10Hnowlan: [C: 03+1] "LGTM based on the example configs used by changeprop!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/882668 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [15:41:16] urbanecm: it should be fine [15:41:24] (03CR) 10Bking: [C: 03+2] wdqs: mount NFS to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/882664 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [15:41:26] thank you, proceeding. [15:44:25] (03CR) 10Ottomata: [C: 03+2] Add to admin_ng/README.md on how to deploy limiting the release [deployment-charts] - 10https://gerrit.wikimedia.org/r/882649 (owner: 10Ottomata) [15:44:36] !log on going maintenance on fasw-codfw [15:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:01] (03CR) 10Elukey: [C: 03+2] changeprop: fix liftwing's body settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/882668 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [15:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43254 and previous config saved to /var/cache/conftool/dbconfig/20230123-154621-root.json [15:46:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 25%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43255 and previous config saved to /var/cache/conftool/dbconfig/20230123-154652-root.json [15:48:38] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:48:49] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:49:23] (03Merged) 10jenkins-bot: Add to admin_ng/README.md on how to deploy limiting the release [deployment-charts] - 10https://gerrit.wikimedia.org/r/882649 (owner: 10Ottomata) [15:50:30] (03PS2) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) [15:50:32] !log Deploy security patch for T327613 [15:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:09] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal, AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:25] (03CR) 10CI reject: [V: 04-1] flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:51:27] (03PS3) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) [15:51:33] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 34, down: 10, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:52:11] (03CR) 10CI reject: [V: 04-1] flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:52:30] (03PS4) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) [15:53:51] !log reprepro -C main include bullseye-wikimedia varnish_6.0.11-1wm1_amd64.changes: T326634 [15:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:54] T326634: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 [15:54:11] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10ssingh) [15:59:15] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:34] the secpatch's deployment is done [15:59:38] (03CR) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:59:43] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:01:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43256 and previous config saved to /var/cache/conftool/dbconfig/20230123-160126-root.json [16:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 50%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43257 and previous config saved to /var/cache/conftool/dbconfig/20230123-160157-root.json [16:04:20] (03CR) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:08:00] 10SRE, 10Traffic, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 3 others: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Pigsonthewing) T261624 was merged here; in that ticket I asked: > On testing, I can see t... [16:11:45] (03CR) 10Jbond: "@alex, i think ill take over this CR unless there are objections" [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [16:12:16] (03CR) 10Jbond: Fix xihua's account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [16:15:28] (03PS5) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) [16:16:00] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) [16:16:02] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) 05Open→03Declined device resurrected itself, decom task declined as its now reporting into ripe portal [16:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43258 and previous config saved to /var/cache/conftool/dbconfig/20230123-161633-root.json [16:16:39] (03CR) 10Ottomata: flink-app - explicitly set Flink ports and configure ingress netpol for them (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:17:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 75%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43259 and previous config saved to /var/cache/conftool/dbconfig/20230123-161702-root.json [16:21:39] (03PS1) 10Ottomata: flink - avoid adding an extra 'k8s_api_enabled' label by using component label instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/882680 (https://phabricator.wikimedia.org/T324576) [16:24:52] (03PS1) 10Stang: newiki: Add new permissions to group reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882681 (https://phabricator.wikimedia.org/T327114) [16:25:22] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Papaul) [16:25:30] (03CR) 10BCornwall: [V: 03+1 C: 03+2] tlsproxy: Remove nginx_tune_for_media [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [16:25:36] (03PS2) 10BCornwall: tlsproxy: Remove nginx_tune_for_media [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) [16:26:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882682 (https://phabricator.wikimedia.org/T128546) [16:27:04] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39208/console" [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [16:29:36] (03CR) 10Ottomata: [C: 03+2] flink-app - explicitly set Flink ports and configure ingress netpol for them (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:29:44] (03CR) 10Ottomata: [C: 03+2] flink - avoid adding an extra 'k8s_api_enabled' label by using component label instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/882680 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1630). nyaa~ [16:30:39] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882682 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:31:23] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882682 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43260 and previous config saved to /var/cache/conftool/dbconfig/20230123-163138-root.json [16:32:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: After adding a column', diff saved to https://phabricator.wikimedia.org/P43261 and previous config saved to /var/cache/conftool/dbconfig/20230123-163207-root.json [16:34:28] (03Merged) 10jenkins-bot: flink-app - explicitly set Flink ports and configure ingress netpol for them [deployment-charts] - 10https://gerrit.wikimedia.org/r/882662 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:34:31] (03Merged) 10jenkins-bot: flink - avoid adding an extra 'k8s_api_enabled' label by using component label instead [deployment-charts] - 10https://gerrit.wikimedia.org/r/882680 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:35:06] (03PS4) 10Jbond: puppet_compiler: serve pson.gz as application/json [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [16:35:07] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:35:09] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:35:55] (03CR) 10Jbond: [C: 03+2] "updated slightly, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [16:36:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: serve pson.gz as application/json [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [16:39:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Ottomata) Approved by me. I think we need someone at WMF to approve/sponser @taavi's membership in this group though. @taavi, could someone maybe in Cloud VPS do this for you? [16:40:00] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:40:04] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:41:12] (03CR) 10Hashar: "I can confirm it makes Firefox pretty print the pson.gz ;) Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/882656 (owner: 10Hashar) [16:41:23] (03CR) 10Jbond: [C: 03+2] admin: Add check for duplicate uid's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882652 (owner: 10Jbond) [16:41:39] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [16:41:43] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [16:41:55] (03CR) 10Jbond: [C: 03+2] Fix xihua's account [puppet] - 10https://gerrit.wikimedia.org/r/881872 (https://phabricator.wikimedia.org/T325004) (owner: 10Alexandros Kosiaris) [16:42:02] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:882682| Bumping portals to master (T128546)]] (duration: 06m 48s) [16:42:05] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:48:51] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:882682| Bumping portals to master (T128546)]] (duration: 06m 48s) [16:48:55] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:50:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:50:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:53:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10jbond) 05Open→03Resolved I have gone ahead and merged the changes to rename this account, please reopen if you have have... [16:56:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:56:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:58:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:58:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:59:41] 10SRE, 10Traffic, 10Data Pipelines (Sprint 07): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10Snwachukwu) #traffic Can you please confirm that there were cases of pages served in ##eqsin## but not reported in ##webrequest logs##. [17:02:26] (03PS1) 10Ottomata: flink-app - netpol must use app: - podSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/882692 [17:05:06] (03PS2) 10Ottomata: flink-app - netpol must use app: - podSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/882692 [17:05:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [17:05:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [17:07:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2114 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P43262 and previous config saved to /var/cache/conftool/dbconfig/20230123-170726-ladsgroup.json [17:22:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2114 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P43263 and previous config saved to /var/cache/conftool/dbconfig/20230123-172231-ladsgroup.json [17:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2114 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P43264 and previous config saved to /var/cache/conftool/dbconfig/20230123-173736-ladsgroup.json [17:39:26] (03CR) 10Herron: [C: 03+1] conftool-data: add logstash[12]032 to kibana7 backend [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite) [17:41:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10jhathaway) Happy to sponsor @taavi for this request [17:44:02] (03CR) 10Dzahn: [C: 03+2] idp: remove racktables related settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:44:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) [17:49:29] (03PS3) 10Dzahn: idp: remove config for racktables [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) [17:49:31] (03PS1) 10Clément Goubert: admin: Grant taavi access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/882696 (https://phabricator.wikimedia.org/T327013) [17:50:22] (03CR) 10Dzahn: [C: 03+1] "lgtm, has approval from ottomata and another SRE as sponsor" [puppet] - 10https://gerrit.wikimedia.org/r/882696 (https://phabricator.wikimedia.org/T327013) (owner: 10Clément Goubert) [17:50:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) @taavi Patch ready, assuming you don't need kerberos access. Here are the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#U... [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - Certificate sessionstore1001-a valid until 2023-02-22 11:12:05 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - Certificate sessionstore1002-a valid until 2023-02-22 11:12:08 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - Certificate sessionstore1003-a valid until 2023-02-22 11:12:10 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - Certificate sessionstore2001-a valid until 2023-02-22 11:12:13 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is CRITICAL: SSL CRITICAL - Certificate sessionstore2002-a valid until 2023-02-22 11:12:16 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:50:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is CRITICAL: SSL CRITICAL - Certificate sessionstore2003-a valid until 2023-02-22 11:12:18 +0000 (expires in 29 days) eevans See: https://phabricator.wikimedia.org/T327675 - The acknowledgement expires at: 2023-01-30 17:50:07. https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:51:15] (03CR) 10Clément Goubert: [C: 03+2] admin: Grant taavi access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/882696 (https://phabricator.wikimedia.org/T327013) (owner: 10Clément Goubert) [17:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2114 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P43265 and previous config saved to /var/cache/conftool/dbconfig/20230123-175241-ladsgroup.json [17:53:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Taavi - https://phabricator.wikimedia.org/T327013 (10Clement_Goubert) 05In progress→03Resolved @taavi Access request merged, you should have your access around 30 minutes from now when puppet has run. R... [17:56:15] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/881938/39209/" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:56:42] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/881938" [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:56:54] (03PS3) 10Hnowlan: thumbor: add and use haproxy healthz lvs check [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) [17:57:00] (03PS2) 10Hnowlan: thumbor: add failure condition to health check [deployment-charts] - 10https://gerrit.wikimedia.org/r/881635 (https://phabricator.wikimedia.org/T233196) [17:57:37] (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Apereo_cas/File[/etc/cas/services/racktables-18.json]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:58:43] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39210/console" [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:59:58] (03CR) 10Dzahn: [C: 03+2] "IDP config was removed on both idp servers and Apache config was removed on miscweb, no problem when refreshing apache" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1800) [18:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T1800). [18:00:36] (03PS5) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:00:39] (03CR) 10Dzahn: [C: 03+2] "spoke too soon :) apache2.service (apache2-apache2-after-network-online-target)]: Skipping because of failed dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:00:43] (03CR) 10CI reject: [V: 04-1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:02:23] PROBLEM - Check systemd state on miscweb2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:35] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39212/console" [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [18:02:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on miscweb2002.codfw.wmnet with reason: debugging on iactive server [18:02:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on miscweb2002.codfw.wmnet with reason: debugging on iactive server [18:03:13] (03CR) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [18:03:17] (03CR) 10JHathaway: [C: 03+2] rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [18:04:06] ACKNOWLEDGEMENT - Check systemd state on miscweb2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service daniel_zahn inactive server, debugging in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:06] ACKNOWLEDGEMENT - Static CodeReview archive HTTP on miscweb2002 is CRITICAL: connect to address 10.192.16.211 and port 80: Connection refused daniel_zahn inactive server, debugging in progress https://wikitech.wikimedia.org/wiki/Static-codereview.wikimedia.org [18:04:06] ACKNOWLEDGEMENT - racktables.wikimedia.org requires authentication on miscweb2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable daniel_zahn inactive server, debugging in progress https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:05:05] !log miscweb1002 - disabling puppet because latest merge would break apache if it runs, debugging in progress on inactive miscweb2002 [18:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:08:16] !log miscweb2002 - unlink /etc/apache2/mods-enabled/auth_cas.conf - unlink /etc/apache2/mods-enabled/auth_cas.load [18:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:49] RECOVERY - Check systemd state on miscweb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:33] (03PS6) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:13:46] (03CR) 10CI reject: [V: 04-1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:14:29] (03CR) 10Dzahn: [C: 03+2] "it still broke because this way puppet did not unload the CAS apache module. so technically should be "ensure absent" instead of just remo" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:18:17] (03CR) 10Dzahn: [C: 03+2] "profile::idp::client::httpd would need to first get an "$ensure" class parameter that absents the mod_conf and the libapache2-mod-auth-cas" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:19:38] !log miscweb2002 - unlink /etc/apache2/mods-enabled/auth_cas.conf - unlink /etc/apache2/mods-enabled/auth_cas.load - apt-get remove libapache2-mod-auth-cas - T327405 [18:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:42] T327405: Decommission Racktables - https://phabricator.wikimedia.org/T327405 [18:22:37] (03PS7) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:23:03] (03CR) 10CI reject: [V: 04-1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:28:26] (03CR) 10Ssingh: "Sorry, I skipped reviewing this for quite a while. Are we still planning on merging these or are we doing a top-level declaration instead?" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:30:20] (03PS8) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:30:59] 10SRE: profile::idp::client::httpd should be absent-able - https://phabricator.wikimedia.org/T327678 (10Dzahn) [18:31:05] (03CR) 10CI reject: [V: 04-1] flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:31:12] (03PS1) 10Jelto: gitlab: exclude shell scripts and other backups from rsync jobs [puppet] - 10https://gerrit.wikimedia.org/r/882704 (https://phabricator.wikimedia.org/T274463) [18:32:13] 10SRE, 10Infrastructure-Foundations: profile::idp::client::httpd should be absent-able - https://phabricator.wikimedia.org/T327678 (10Dzahn) [18:33:31] (03PS9) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [18:38:32] (03PS1) 10Jforrester: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882705 [18:38:34] (03PS1) 10Jforrester: Replace wgBetaFeaturesWhitelist with wgBetaFeaturesAllowList, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882706 [18:42:39] (03CR) 10Dzahn: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/882704 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:48:27] !log miscweb1002 - unload CAS apache module and config; apt-get remove libapache2-mod-auth-cas [18:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:50:15] (03CR) 10Dzahn: [C: 03+2] "unloaded the module and removed the package manually on miscweb*, which are fine now. also did a follow-up ticket but not sure how importa" [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:50:54] (03PS2) 10Dzahn: miscweb: remove racktables profile from miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/881694 (https://phabricator.wikimedia.org/T327405) [18:51:09] 10SRE-swift-storage, 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T327681 (10TheresNoTime) [18:51:25] (03CR) 10Ottomata: [C: 03+2] flink-app - netpol must use app: - podSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/882692 (owner: 10Ottomata) [18:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:55:24] Just surfacing that T327681 is intermittently causing user facing exceptions — fairly low rate, not consistently repeatable [18:55:24] T327681: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T327681 [18:57:33] (03Merged) 10jenkins-bot: flink-app - netpol must use app: - podSelector [deployment-charts] - 10https://gerrit.wikimedia.org/r/882692 (owner: 10Ottomata) [19:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:11:05] 10SRE, 10Infrastructure-Foundations: profile::idp::client::httpd should be absent-able - https://phabricator.wikimedia.org/T327678 (10Dzahn) p:05Triage→03Low [19:12:14] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/881694/39213/" [puppet] - 10https://gerrit.wikimedia.org/r/881694 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [19:16:18] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [19:16:31] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase1016.eqiad.wmnet [19:17:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [19:17:55] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase1016.eqiad.wmnet [19:18:17] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [19:19:00] (03CR) 10Dzahn: "we should not forget there is also this include: modules/profile/manifests/mariadb/grants/production.pp: include passwords::racktables " [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [19:24:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet [19:30:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1019.eqiad.wmnet [19:36:24] (03PS1) 10Jdrewniak: Enable Page Tools for logged-in users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882715 (https://phabricator.wikimedia.org/T327686) [19:37:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1019.eqiad.wmnet [19:41:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet [19:48:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1020.eqiad.wmnet [19:49:30] 10SRE, 10Domains, 10Traffic-Icebox: Redirecting incoming queries to non-existent subpages (due to Godaddy behavior on some external WikiJournal sites) - https://phabricator.wikimedia.org/T212914 (10BCornwall) 05Open→03Resolved a:03BCornwall It looks like they've managed to escape the talons of godaddy... [19:58:51] (03PS1) 10BCornwall: varnish: Reword misc-frontend vcl_switch comment [puppet] - 10https://gerrit.wikimedia.org/r/882716 (https://phabricator.wikimedia.org/T205988) [19:59:13] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10BCornwall) 05Open→03In progress a:03BCornwall [19:59:25] 10SRE: Expired puppet certificates - https://phabricator.wikimedia.org/T260110 (10Aklapper) [19:59:37] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Simplify comment misc-frontend.inc.vcl.erb - https://phabricator.wikimedia.org/T205988 (10BCornwall) Since this ticket is relevant to the comment itself, let's just fix that and follow-up with another, more detailed description of what needs refactoring. [20:01:13] 10SRE, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10Urbanecm) [20:05:58] (03PS2) 10Krinkle: Use core's PoolCounterClient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881466 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [20:07:15] (03CR) 10Krinkle: [C: 03+1] "LGTM. This needs careful testing on mwdebug with PC hits and misses, e.g. browse old and current revisions on various articles and confirm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881466 (https://phabricator.wikimedia.org/T327336) (owner: 10Zabe) [20:10:28] 10SRE, 10Traffic-Icebox: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521 (10BCornwall) 05Open→03Invalid It's sad that no action was taken in the years since the report has been opened, but it appears that @tgr is correct and it's ready to be... [20:19:24] (03CR) 10Dzahn: admin/canary_appserver: add group of users allowed to disable puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [20:23:19] 10SRE, 10Traffic-Icebox: Unwanted service startups and their triggers - https://phabricator.wikimedia.org/T191017 (10BCornwall) 05Open→03Resolved a:03BCornwall `systemctl mask` achieves what is desired here and has been successfully implemented with varnishncsa.service and varnishlog.service (see `Change... [20:26:14] (03PS1) 10Andrea Denisse: centrallog2002: Apply partman standard software raid recipe [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T313858) [20:31:46] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39214/console" [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T313858) (owner: 10Andrea Denisse) [20:45:16] !log restart T315510 on group1 after mwmaint restart, currently running on wikidatawiki [20:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:20] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:45:40] (03CR) 10Thcipriani: [C: 03+1] admin/canary_appserver: add group of users allowed to disable puppet [puppet] - 10https://gerrit.wikimedia.org/r/879147 (https://phabricator.wikimedia.org/T305979) (owner: 10Dzahn) [20:45:54] (03CR) 10Andrea Denisse: "PCC results: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39214/console" [puppet] - 10https://gerrit.wikimedia.org/r/882718 (https://phabricator.wikimedia.org/T313858) (owner: 10Andrea Denisse) [20:56:10] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [20:56:14] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [20:58:58] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) I also am not sure of how to find out consumers of the HTTP-only services, but I've created a WIP patch that at least lists the candidates. [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T2100). [21:00:05] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:19] I can deploy. [21:00:56] kindrobot: ok thanks [21:01:54] !log start UTC late backport window [21:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882715 (https://phabricator.wikimedia.org/T327686) (owner: 10Jdrewniak) [21:02:57] (03Merged) 10jenkins-bot: Enable Page Tools for logged-in users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882715 (https://phabricator.wikimedia.org/T327686) (owner: 10Jdrewniak) [21:03:10] (thanks kindrobot, I'm finally back in the "right timezone" so should be able to pick up more again!) [21:03:11] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:882715|Enable Page Tools for logged-in users on enwiki (T327686)]] [21:03:15] T327686: Deploy page tools for logged-in users on English Wikipedia - https://phabricator.wikimedia.org/T327686 [21:04:35] My pleasure TheresNoTime! The only days I am free for this window are Monday and Wednesday, so I try to pick up one of those a week if I can. [21:04:54] !log kindrobot@deploy1002 jdrewniak and kindrobot: Backport for [[gerrit:882715|Enable Page Tools for logged-in users on enwiki (T327686)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:05:04] jan_drewniak: can you confirm? [21:05:58] kindrobot: yup looks good [21:06:08] Great, syncing. [21:09:29] (03PS1) 10Andrea Denisse: centrallog1002: Add to eqiad anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/882724 (https://phabricator.wikimedia.org/T318778) [21:10:09] (03PS2) 10Bking: flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) [21:11:45] (03CR) 10Ottomata: flink-kubernetes-operator: bump version to 1.3.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [21:12:12] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:882715|Enable Page Tools for logged-in users on enwiki (T327686)]] (duration: 09m 00s) [21:12:16] T327686: Deploy page tools for logged-in users on English Wikipedia - https://phabricator.wikimedia.org/T327686 [21:12:43] !log close UTC late backport window [21:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1021.eqiad.wmnet [21:23:36] (03PS1) 10Zabe: throttle: Remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882746 [21:26:13] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) I've brought the issue up with langcom on their [[ https://meta.wikimedia.org/wiki/Talk:Language_c... [21:26:23] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) 05Open→03In progress [21:29:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1021.eqiad.wmnet [21:31:25] (03CR) 10Zabe: [C: 03+2] throttle: Remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882746 (owner: 10Zabe) [21:31:47] (03PS1) 10Andrea Denisse: centrallog: Add centrallog1002 as Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/882747 (https://phabricator.wikimedia.org/T318778) [21:32:09] (03Merged) 10jenkins-bot: throttle: Remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/882746 (owner: 10Zabe) [21:32:53] !log zabe@deploy1002 Started scap: Backport for [[gerrit:882746|throttle: Remove expired rule]] [21:34:35] !log zabe@deploy1002 zabe: Backport for [[gerrit:882746|throttle: Remove expired rule]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:35:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet [21:36:11] (03PS1) 10Nray: Work around sticky-positioned layers disabling subpixel rendering [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/882727 (https://phabricator.wikimedia.org/T327460) [21:40:30] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) 05Open→03In progress bblack has some ideas: ` 13:27 we don't have a 100% reliable spot-check to know for sure 13:27 but yeah, we can guestim... [21:40:39] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:882746|throttle: Remove expired rule]] (duration: 07m 45s) [21:42:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet [21:49:25] (03CR) 10Cwhite: [C: 03+2] logstash: enable filters for ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/881812 (https://phabricator.wikimedia.org/T326794) (owner: 10Cwhite) [22:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230123T2200). [22:07:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:08:35] Hey all - had a couple of security patches we were going to try to deploy today: T285159, T296593 [22:14:18] (ProbeDown) firing: (2) Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:49] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe1003.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:15:15] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:19:18] (ProbeDown) resolved: (2) Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:22:01] (03PS1) 10Bking: dse-k8s: add rdf-streaming-update-ng namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) [22:24:40] (03CR) 10Cwhite: [C: 04-1] mediawiki: Update ecs logging to 1.11.0 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [22:25:54] (03CR) 10Cwhite: [C: 03+2] Clarify ecs.version field format in docs [software/ecs] - 10https://gerrit.wikimedia.org/r/881809 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [22:26:54] (03CR) 10Cwhite: [C: 03+2] add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [22:27:25] (03Merged) 10jenkins-bot: add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [22:27:47] (03CR) 10Cwhite: [C: 03+2] Clarify ecs.version field format in docs [software/ecs] - 10https://gerrit.wikimedia.org/r/881809 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [22:28:15] (03Merged) 10jenkins-bot: Clarify ecs.version field format in docs [software/ecs] - 10https://gerrit.wikimedia.org/r/881809 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [22:28:46] (03CR) 10Cwhite: [C: 03+2] role: remove kibana7_ecs role [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [22:31:58] !log Deployed patch for T285159 [22:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:47] (03CR) 10Cwhite: [C: 03+1] "Tested upgrade and initial install on beta. Works great!" [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [22:37:13] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [22:45:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1017.eqiad.wmnet [22:46:24] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.119`. Pre-deploy tests passing on canary `wdqs1003` [22:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:59] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@544f5f3]: 0.3.119 [22:49:52] !log [WDQS Deploy] Tests passing following deploy of `0.3.119` on canary `wdqs1003`; proceeding to rest of fleet [22:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:42] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1022.eqiad.wmnet [22:56:29] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@544f5f3]: 0.3.119 (duration: 07m 30s) [22:57:43] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [22:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:48] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [22:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:55] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [22:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1022.eqiad.wmnet [23:07:41] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1023.eqiad.wmnet [23:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:11:33] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1023.eqiad.wmnet [23:17:12] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1024.eqiad.wmnet [23:24:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1024.eqiad.wmnet [23:24:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet [23:31:35] (03CR) 10Cwhite: [C: 04-2] "Blocking until we can work out a path forward." [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [23:31:43] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet [23:51:55] (03PS2) 10Cwhite: logstash: Add PTR resolution to firewall logs [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [23:57:59] (03CR) 10Cwhite: logstash: Add PTR resolution to firewall logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi)