[00:00:05] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T0000). Please do the needful. [00:05:48] Krinkle James_F Around to review https://gerrit.wikimedia.org/r/c/mediawiki/core/+/697887, this would unblock the deployment I have right now :D [00:06:29] Amir1: Oh, right, yeah, that'd help wouldn't it? [00:06:47] :D [00:06:59] At least I confirmed that this fixed the error [00:07:14] Thanks <3 [00:07:54] (03PS1) 10Ladsgroup: user: Accept options-messages for multiselect user options [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697818 (https://phabricator.wikimedia.org/T58633) [00:07:59] Tsk for hot-deploying unpushed patches even if just to debug. :-) [00:08:02] (03CR) 10Ladsgroup: [C: 03+2] user: Accept options-messages for multiselect user options [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697818 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [00:09:09] James_F: isn't it possible? I thought it's okay to do that from time to time on mwdebug [00:09:21] to note, I didn't push it everywhere yet [00:10:03] Amir1: It's fine. I've done it a few times. :-) [00:10:13] First rule of deployments: Fix production. [00:10:36] isn't it "Don't talk about deployments"? [00:11:59] Thanks for it though [00:12:10] I hope this improves the preferences [00:12:11] * James_F grins. [00:13:01] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [00:17:57] the master is failing on merge but it seems it's just selenium being flaky [00:18:12] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [00:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:23] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:29] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1007.eqiad.wmnet --dest wdqs1003.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [00:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:33] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:23:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:54] (03Merged) 10jenkins-bot: user: Accept options-messages for multiselect user options [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/697818 (https://phabricator.wikimedia.org/T58633) (owner: 10Ladsgroup) [00:35:00] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:14] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1007.eqiad.wmnet --dest wdqs1003.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [00:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:18] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:36:35] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/includes/user/UserOptionsManager.php: Backport: [[gerrit:697818|user: Accept options-messages for multiselect user options (T58633 T278650)]] (duration: 00m 57s) [00:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:40] T58633: Preference retrieval should not require so much parsing - https://phabricator.wikimedia.org/T58633 [00:36:40] T278650: Loading or saving preferences is taking too long on Commons - https://phabricator.wikimedia.org/T278650 [00:40:00] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/Gadgets: Backport: [[gerrit:697816|Reduce message parse in GadgetHooks::getPreferences (second time) (T58633 T278650)]], Try II (duration: 00m 57s) [00:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:10] James_F: https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?viewPanel=19&orgId=1&refresh=5m&var-metric=p50&var-module=options&from=now-3h&to=now [00:45:29] Amir1: Just a tiny improvement there. :-) [00:45:50] Amir1: Something for K.rinkle's performance report, I guess… [00:46:00] Already pinged him :D [00:46:06] Ha, cool. [00:46:29] It also cut 99th percentile to a quarter but it's so noisy, it's hard to see [01:00:11] put the details in https://phabricator.wikimedia.org/T278650#7130951 [01:25:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) I know work is still ongoing but just wanted to say - thanks for all your work on this Papaul! I know the server's in good hands with... [01:43:16] !log [WDQS] Pooled `wdqs1004` (caught up on lag) [01:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:42] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 apparently disregarding nonmember addresses set to accept - https://phabricator.wikimedia.org/T284182 (10Ladsgroup) So to recap the problem is that you set to hold all WMF emails but overrode it per case using non-members. I honestly think this is hacky, why not just... [01:47:12] !log T280382 `wdqs2003.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` [01:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:16] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [01:50:20] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.132e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:51:15] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2006.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [01:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:45] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10ssingh) Thanks for all the help and sorry it took a while! I think 10G should be fine for now given the current usage on the other Wikidough hosts. [02:04:45] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:22] !log T280382 `wdqs1003.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` [02:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:25] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:06:28] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 5307 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:07:12] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2006.codfw.wmnet with reason: REIMAGE [02:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:40] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1008.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [02:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:12] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1003 is CRITICAL: 5121 ge 3600 Ryan Kemper just over 1 hour of lag, recovering https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:09:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2006.codfw.wmnet with reason: REIMAGE [02:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:31] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1008.eqiad.wmnet with reason: REIMAGE [02:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:34] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1008.eqiad.wmnet with reason: REIMAGE [02:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:23] RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1016 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:50:17] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:30] (03PS1) 10Marostegui: Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/697820 [04:28:27] (03CR) 10Marostegui: [C: 03+2] Revert "db1144: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/697820 (owner: 10Marostegui) [04:28:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 25%: Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P16255 and previous config saved to /var/cache/conftool/dbconfig/20210603-042851-root.json [04:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:35] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:43] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1005.eqiad.wmnet --dest wdqs1008.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [04:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:47] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:29:53] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [04:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:59] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2004.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [04:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:56] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:17] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [04:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:28] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [04:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:34] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2004.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [04:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:38] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:36:45] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:07] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1005.eqiad.wmnet --dest wdqs1008.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [04:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:35] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Marostegui) Not sure if this was maintenance or not, but this host rebooted again around 9h ago. ` root@db2100:~# uptime 04:40:17 up 9:28, 1 user, l... [04:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 50%: Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P16256 and previous config saved to /var/cache/conftool/dbconfig/20210603-044355-root.json [04:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:41] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Marostegui) 05Resolved→03Open It happened again: ` hpiLO-> show record35 status=0 status_tag=COMMAND COMPLETED Thu Jun 3 04:47:53 2... [04:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 75%: Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P16257 and previous config saved to /var/cache/conftool/dbconfig/20210603-045859-root.json [04:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 100%: Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P16258 and previous config saved to /var/cache/conftool/dbconfig/20210603-051402-root.json [05:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P16259 and previous config saved to /var/cache/conftool/dbconfig/20210603-051853-marostegui.json [05:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:04] !log Deploy schema change on db1121, lag will appear on s4 (commonswiki) wiki replicas - T266486 T268392 T273360 [05:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:11] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [05:20:11] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [05:20:12] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [05:21:52] (03PS1) 10Marostegui: db1121,db1155,clouddb*s4: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/697898 (https://phabricator.wikimedia.org/T266486) [05:22:59] (03CR) 10Marostegui: [C: 03+2] db1121,db1155,clouddb*s4: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/697898 (https://phabricator.wikimedia.org/T266486) (owner: 10Marostegui) [05:45:40] (03Abandoned) 10Fomafix: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262 (owner: 10Fomafix) [06:05:23] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:34] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 5149 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:07:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:58] PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 5372 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:23:48] !log T280382 `wdqs1008.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` [06:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:53] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [06:23:54] !log T280382 `wdqs2006.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` [06:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:14] !log [WDQS] De-pooled `wdqs1008` and `wdqs2006` (~1 hour of lag to catch up on) [06:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:57] (03CR) 10Giuseppe Lavagetto: "The change look technically correct, but I think we need to take a look back at this and revisit it." [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [06:56:26] (03CR) 10Muehlenhoff: [C: 03+2] Create debmonitor user on buster with adduser [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/697734 (https://phabricator.wikimedia.org/T256098) (owner: 10Muehlenhoff) [06:56:28] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 913.9 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:59:36] RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 1050 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:08:12] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10elukey) @JAnstee_WMF do you recall what password you used when creating the ssh key? It may be different from what you have saved, have you tried others? We canno... [07:10:35] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Hey, @Papaul, can you check this? If the stick was bad (and so not recognized/enabled), I wouldn't expect it to reboot again. However, apparent... [07:14:48] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:40] (03PS1) 10Giuseppe Lavagetto: Add port to nutcracker pool [deployment-charts] - 10https://gerrit.wikimedia.org/r/697906 [07:26:07] (03PS3) 10Fomafix: Add redirects for https://nan.wik{tionary,iquote,ibooks,isource}.org [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T86915) [07:31:22] (03CR) 10Ema: alertmanager: attach runbook/dashboard URLs to IRC messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [07:47:45] (03PS7) 10Fomafix: Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) [07:48:13] (03CR) 10Ema: alerts: reload prometheus instances after deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [07:48:17] (03PS3) 10Fomafix: Add 'vro' as alias for 'fiu-vro' [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) [07:48:26] !log uploaded debmonitor-client 0.3.0-1+deb10u2 to apt.wikimedia.org [07:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:55] (03PS3) 10Fomafix: Add 'egl' as alias for 'eml' [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) [07:52:54] !log [WDQS] Pooled `wdqs1008` and `wdqs2006` (all caught up on lag) [07:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:17] (03PS4) 10Fomafix: Add 'cmn' as alias for 'zh' [puppet] - 10https://gerrit.wikimedia.org/r/528835 (https://phabricator.wikimedia.org/T23915) [07:58:10] (03CR) 10Volans: "indentation nit" (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/697734 (https://phabricator.wikimedia.org/T256098) (owner: 10Muehlenhoff) [08:06:03] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Joe) >>! In T281423#7127666, @Legoktm wrote: > ` > legoktm@deploy1002:~$ curl https://staging.svc.eqiad.wmnet:4008/index.php > File not found. > legoktm@deploy10... [08:07:06] (03PS3) 10Fomafix: Add 'rup' as alias for 'roa-rup' [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) [08:09:57] !log upgrading esams/eqsin to debmonitor-client 0.3.0 (along with deleting/recreating system user within 100-499 range) [08:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:04] (03PS1) 10Jgiannelos: Use consistent binary name for tegola builds [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/697911 [08:11:59] (03CR) 10Filippo Giunchedi: alerts: reload prometheus instances after deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:12:52] (03CR) 10Jgiannelos: "Currently binary builds are names as `tegola_` which breaks the way blubber entrypoint is defined." [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/697911 (owner: 10Jgiannelos) [08:14:43] (03PS4) 10Filippo Giunchedi: alertmanager: add a sample alert and test instructions [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) [08:14:45] (03PS2) 10Filippo Giunchedi: alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) [08:16:44] (03PS3) 10Fomafix: Add 'cbk' as alias for 'cbk-zam' [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) [08:19:35] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:33] (03CR) 10Filippo Giunchedi: alertmanager: attach runbook/dashboard URLs to IRC messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:20:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add port to nutcracker pool [deployment-charts] - 10https://gerrit.wikimedia.org/r/697906 (owner: 10Giuseppe Lavagetto) [08:23:15] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:04] (03PS3) 10Fomafix: Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) [08:24:15] (03PS2) 10Filippo Giunchedi: alertmanager: attach runbook/dashboard URLs to IRC messages [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) [08:24:17] (03PS5) 10Filippo Giunchedi: alertmanager: add a sample alert and test instructions [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) [08:24:19] (03PS3) 10Filippo Giunchedi: alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) [08:24:21] (03CR) 10jerkins-bot: [V: 04-1] Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [08:28:00] (03PS3) 10Fomafix: Add 'bho' as alias for 'bh' [puppet] - 10https://gerrit.wikimedia.org/r/528782 (https://phabricator.wikimedia.org/T41968) [08:28:54] (03PS1) 10Effie Mouzeli: mwdebug: fix nutcracker pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/697915 [08:29:05] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: fix nutcracker pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/697915 (owner: 10Effie Mouzeli) [08:30:39] (03PS1) 10Filippo Giunchedi: New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/697916 [08:31:08] (03PS2) 10Fomafix: Add 'cmn' as alias for 'zh' [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) [08:31:21] (03PS1) 10Giuseppe Lavagetto: mwdebug: use the appropriate request for memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/697917 [08:37:32] !log upgrading codfw to debmonitor-client 0.3.0 (along with deleting/recreating system user within 100-499 range) T235162 [08:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:36] T235162: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 [08:38:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: use the appropriate request for memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/697917 (owner: 10Giuseppe Lavagetto) [08:40:24] (03Merged) 10jenkins-bot: mwdebug: use the appropriate request for memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/697917 (owner: 10Giuseppe Lavagetto) [08:41:48] (03PS11) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [08:42:30] (03CR) 10Ema: [C: 03+1] alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:42:40] ·debmonitor-client: error: argument -s/--server is required unless -n/--dry-run is set· [08:43:01] (03CR) 10Ema: [C: 03+1] alertmanager: add a sample alert and test instructions [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:43:52] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:43:53] (03CR) 10Ema: [C: 03+1] alertmanager: attach runbook/dashboard URLs to IRC messages [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:20] (03PS2) 10Fomafix: Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) [08:44:30] it is codfw only, so it matches your log, moritzm ? [08:45:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:47:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:10] jynus: where do you see that? [08:49:16] mail :-) [08:49:39] "Output of systemd timer for '/usr/bin/debmonitor-client'" subject [08:50:03] looking [08:50:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.mon: don't subscribe to ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/697715 (owner: 10David Caro) [08:50:19] maybe it was temporary, I thought it was ongoing, but seems stopped now [08:52:11] yeah, I think this is caused by the systemd timer triggering while I ran a debmonitor-client for A:codfw (since it needs to catch up with the temporary removal/reinstallation) [08:55:27] !log uploading gitlab-ce 13.11.5-ce to apt.wikimedia.org thirdparty/gitlab [08:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:21] (03CR) 10Jcrespo: "Let's wait for feedback before merging. It doesn't make sense to merge now if all there it is now is a 0-byte file." [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:06:23] (03CR) 10David Caro: [C: 03+2] ceph.mon: don't subscribe to ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/697715 (owner: 10David Caro) [09:19:51] (03PS1) 10Kormat: db-eqiad.php: Set pc1010 as pc2 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697921 (https://phabricator.wikimedia.org/T282761) [09:25:18] 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10MoritzMuehlenhoff) 05Resolved→03Open Contractors with a foo-ctr@wikimedia.org address should be in cn=wmf, not cn=nda. [09:25:38] (03CR) 10Marostegui: [C: 03+1] db-eqiad.php: Set pc1010 as pc2 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697921 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [09:37:26] !log upgrading eqiad to debmonitor-client 0.3.0 (along with deleting/recreating system user within 100-499 range) T235162 [09:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:31] T235162: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 [09:38:47] !log Deploy schema change on s3 codfw master (with replication) - T282373 T282372 T282371 [09:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:53] T282372: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 [09:38:54] T282373: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 [09:38:55] T282371: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 [09:41:09] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: attach runbook/dashboard URLs to IRC messages [puppet] - 10https://gerrit.wikimedia.org/r/697721 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [09:41:12] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add a sample alert and test instructions [puppet] - 10https://gerrit.wikimedia.org/r/697722 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [09:41:14] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: reload prometheus instances after deploy [puppet] - 10https://gerrit.wikimedia.org/r/697737 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [09:45:38] PROBLEM - DPKG on an-master1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:46:39] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/691466 (owner: 10PipelineBot) [09:49:53] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 apparently disregarding nonmember addresses set to accept - https://phabricator.wikimedia.org/T284182 (10MarcoAurelio) >>! In T284182#7130997, @Ladsgroup wrote: > So to recap the problem is that you set to hold all WMF emails but overrode it per case using non-members... [09:58:04] (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/696724 (owner: 10PipelineBot) [10:02:59] (03CR) 10David Caro: [C: 03+2] ceph: don't log to file as syslog works already [puppet] - 10https://gerrit.wikimedia.org/r/696330 (https://phabricator.wikimedia.org/T281247) (owner: 10David Caro) [10:08:22] 10SRE, 10observability, 10serviceops-radar, 10User-fgiunchedi: Prometheus PoPs disk space utilization - https://phabricator.wikimedia.org/T277163 (10fgiunchedi) The latter: AFAICS prometheus in esams is significantly larger than its counterparts in e.g. eqsin or ulsfo. IIRC this is due to the migration wor... [10:08:28] (03PS2) 10David Caro: ceph: add syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/695299 (https://phabricator.wikimedia.org/T281247) [10:08:44] (03CR) 10David Caro: [C: 03+2] ceph: add syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/695299 (https://phabricator.wikimedia.org/T281247) (owner: 10David Caro) [10:10:07] huh. no jouncebot [10:10:40] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Set pc1010 as pc2 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697921 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [10:11:30] jouncebot: now [10:11:36] (03Merged) 10jenkins-bot: db-eqiad.php: Set pc1010 as pc2 primary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697921 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [10:11:37] For the next 0 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1000) [10:13:53] !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc2 primary T282761 (duration: 00m 58s) [10:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [10:16:00] RECOVERY - DPKG on an-master1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:18:35] (03PS1) 10Filippo Giunchedi: alertmanager: highlight 'instance' label in alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) [10:19:34] (03PS1) 10Effie Mouzeli: mwdebug: fix values-pinkunicorn.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/697925 [10:19:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P16261 and previous config saved to /var/cache/conftool/dbconfig/20210603-101950-marostegui.json [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:29] (03CR) 10Filippo Giunchedi: "See preview at https://phabricator.wikimedia.org/F34479225" [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [10:20:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: fix values-pinkunicorn.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/697925 (owner: 10Effie Mouzeli) [10:21:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Purging parsercache T282761 [10:21:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Purging parsercache T282761 [10:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:14] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [10:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:18] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: fix values-pinkunicorn.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/697925 (owner: 10Effie Mouzeli) [10:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16262 and previous config saved to /var/cache/conftool/dbconfig/20210603-102354-root.json [10:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:45] (03Merged) 10jenkins-bot: mwdebug: fix values-pinkunicorn.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/697925 (owner: 10Effie Mouzeli) [10:28:22] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:28:26] RECOVERY - DPKG on an-master1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:55] 10SRE, 10Patch-For-Review, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10MoritzMuehlenhoff) >>! In T235162#6751520, @MoritzMuehlenhoff wrote: > Status update: The puppetised adduser.conf is rolled out, what remains is to fix some of t... [10:35:08] (03PS1) 10Urbanecm: lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) [10:35:33] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/696724 (owner: 10PipelineBot) [10:37:00] (03PS1) 10David Caro: gitignore: add vscode temp files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697929 [10:37:03] (03PS1) 10David Caro: openstack.cloudvirt.{un}set_maintenance: use current host aggregates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 [10:38:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/696724 (owner: 10PipelineBot) [10:38:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16263 and previous config saved to /var/cache/conftool/dbconfig/20210603-103858-root.json [10:40:18] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:40:39] (03PS1) 10David Caro: Review access change [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697822 [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:15] !log test librenms/AM paging [10:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] (03CR) 10Volans: Review access change (031 comment) [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697822 (owner: 10David Caro) [10:42:29] there we go [10:47:38] (03PS1) 10Urbanecm: jawiki: extended confirmed should be 120 days since first edit, not registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697932 (https://phabricator.wikimedia.org/T284212) [10:51:37] (03CR) 10Zabe: [C: 03+1] jawiki: extended confirmed should be 120 days since first edit, not registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697932 (https://phabricator.wikimedia.org/T284212) (owner: 10Urbanecm) [10:52:43] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16264 and previous config saved to /var/cache/conftool/dbconfig/20210603-105402-root.json [10:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P16265 and previous config saved to /var/cache/conftool/dbconfig/20210603-105536-marostegui.json [10:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:35] jouncebot: next [10:59:35] In 0 hour(s) and 0 minute(s): EU Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1100) [10:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16266 and previous config saved to /var/cache/conftool/dbconfig/20210603-105953-root.json [10:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) EU Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1100). [11:00:09] (03CR) 10Klausman: [C: 03+1] sudo: drop keep_env option [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond) [11:00:25] No one has signed up for the training in this backport window, and there are no patches listed for deployment. [11:00:48] As such, I will not join the google meet for the training today. [11:04:02] apergos: may i self-deploy sth? [11:04:49] (03PS1) 10David Caro: Review access change [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697823 [11:05:28] (03CR) 10DannyS712: [C: 03+1] jawiki: extended confirmed should be 120 days since first edit, not registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697932 (https://phabricator.wikimedia.org/T284212) (owner: 10Urbanecm) [11:05:41] (03Abandoned) 10David Caro: Review access change [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697822 (owner: 10David Caro) [11:07:31] (03CR) 10Urbanecm: [C: 03+2] jawiki: extended confirmed should be 120 days since first edit, not registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697932 (https://phabricator.wikimedia.org/T284212) (owner: 10Urbanecm) [11:07:34] going ahead [11:08:46] (03CR) 10Volans: [C: 03+1] "As agreed the process would be to try to keep the wmcs branch rebased on top of master for an easy of merge back to master later that woul" [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697823 (owner: 10David Caro) [11:08:57] (03Merged) 10jenkins-bot: jawiki: extended confirmed should be 120 days since first edit, not registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697932 (https://phabricator.wikimedia.org/T284212) (owner: 10Urbanecm) [11:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repool db1179', diff saved to https://phabricator.wikimedia.org/P16267 and previous config saved to /var/cache/conftool/dbconfig/20210603-110906-root.json [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e84096857c8a2f753e077aa6c3e37b910b9e1fcd: jawiki: extended confirmed should be 120 days since first edit, not registration (T284212) (duration: 00m 58s) [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] T284212: request for modify jawiki extended confirmed user requirement - https://phabricator.wikimedia.org/T284212 [11:10:44] * urbanecm done [11:10:53] (03CR) 10David Caro: [V: 03+2 C: 03+2] Review access change [cookbooks] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/697823 (owner: 10David Caro) [11:11:54] sorry, I had already checked out :-D [11:12:06] you're listed as a deployer on the window, so you can ask yourself :-D [11:12:17] urbanecm: [11:12:28] (03PS1) 10Elukey: [WIP] - Add the operators.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [11:12:34] not on the training one [11:12:39] anyway, i'm already done :D [11:12:47] are you not on this window? woops. I guess I looked at an earlier one [11:12:59] anyways fine :-D [11:13:18] yeah. I'm listed in all B&C windows, but not the training ones [11:13:43] ah gotcha. yeah I got it confused with an earlier one for which I looked at both patches, wondered why they were merged already, noted that they looked nice and easy... [11:13:50] then saw the window was already over :_D [11:14:10] (03PS2) 10David Caro: gitignore: add vscode temp files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697929 [11:14:18] (03PS2) 10David Caro: openstack.cloudvirt.{un}set_maintenance: use current host aggregates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 [11:14:43] (03CR) 10Elukey: "This is a very high level proposal about what I have been discussing with several people during the past days. It still misses the RBAC an" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [11:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16268 and previous config saved to /var/cache/conftool/dbconfig/20210603-111456-root.json [11:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:36] (03PS2) 10Elukey: [WIP] - Add the operators.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [11:18:26] (03CR) 10Elukey: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [11:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P16269 and previous config saved to /var/cache/conftool/dbconfig/20210603-112243-marostegui.json [11:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16270 and previous config saved to /var/cache/conftool/dbconfig/20210603-112620-root.json [11:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:39] 10SRE, 10Phabricator, 10Wikimedia-Bugzilla, 10Tracking-Neverending: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184 (10Aklapper) [11:30:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16271 and previous config saved to /var/cache/conftool/dbconfig/20210603-113000-root.json [11:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:00] 10SRE, 10Patch-For-Review, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10MoritzMuehlenhoff) >>! In T235162#7131437, @MoritzMuehlenhoff wrote: > Now that debmonitor is fixed, the final thing to do is to audit whether there are other sy... [11:33:45] (03PS1) 10Effie Mouzeli: mwdebug: more fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/697940 [11:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16272 and previous config saved to /var/cache/conftool/dbconfig/20210603-114124-root.json [11:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157', diff saved to https://phabricator.wikimedia.org/P16273 and previous config saved to /var/cache/conftool/dbconfig/20210603-114325-marostegui.json [11:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16274 and previous config saved to /var/cache/conftool/dbconfig/20210603-114503-root.json [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16275 and previous config saved to /var/cache/conftool/dbconfig/20210603-114731-root.json [11:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:18] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 249014968 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:53:46] !log installing curl security updates on stretch [11:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:08] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 500192 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:56:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16276 and previous config saved to /var/cache/conftool/dbconfig/20210603-115628-root.json [11:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16277 and previous config saved to /var/cache/conftool/dbconfig/20210603-120235-root.json [12:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:42] !log installing lz4 security updates on buster [12:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:30] !log restarting FPM on mw canaries to pick up lz4 update [12:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:59] (known issue that the librenms page didn't notify here) [12:09:16] but should we worry about the issue? [12:09:34] yes [12:09:52] it is being discussed in -traffic, I'll ack the page [12:10:26] oh, I see the description on alertmanager was more hidden [12:10:48] (on the dashboard) [12:11:19] (03PS1) 10Ssingh: acme_chief: authorize doh300* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697942 (https://phabricator.wikimedia.org/T252132) [12:11:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Repool db1166', diff saved to https://phabricator.wikimedia.org/P16278 and previous config saved to /var/cache/conftool/dbconfig/20210603-121133-root.json [12:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P16279 and previous config saved to /var/cache/conftool/dbconfig/20210603-121205-marostegui.json [12:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29785/console" [puppet] - 10https://gerrit.wikimedia.org/r/697942 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:13:52] yeah 'description' is meant to be sort of the "long" explanation and thus hidden by default [12:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16280 and previous config saved to /var/cache/conftool/dbconfig/20210603-121548-root.json [12:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] (03CR) 10JMeybohm: [C: 04-1] mwdebug: more fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697940 (owner: 10Effie Mouzeli) [12:17:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16281 and previous config saved to /var/cache/conftool/dbconfig/20210603-121739-root.json [12:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10aborrero) Hey, any updates on this? Can we help with anything, for example the switch config? [12:24:06] (03PS1) 10Filippo Giunchedi: alertmanager: cc -operations on IRC for all SRE pages [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) [12:27:36] (03PS1) 10Ssingh: site: switch doh3001 and doh3002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697944 [12:29:13] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) >>! In T243057#7130650, @Dzahn wrote: > When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right? They are inten... [12:30:45] (03PS2) 10Ssingh: site: switch doh3001 and doh3002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697944 [12:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16282 and previous config saved to /var/cache/conftool/dbconfig/20210603-123052-root.json [12:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16283 and previous config saved to /var/cache/conftool/dbconfig/20210603-123243-root.json [12:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:26] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) >>! In T243057#7131632, @MoritzMuehlenhoff wrote: >>>! In T243057#7130650, @Dzahn wrote: >> When we recreate the bastions without prometheus, we don't need to... [12:42:44] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) First error ` Memory Error Threshold Exceeded (Processor 1, DIMM 5) ` second error ` Uncorrectable Memory Error (Processor 1, DIMM 6) ` Third e... [12:44:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) Last update from Dell ` The Dell replacement part(s) for your POWEREDGE R440,ICE PE has been shipped by FEDX on tracking number 97796... [12:44:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) @RKemper you welcome [12:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16284 and previous config saved to /var/cache/conftool/dbconfig/20210603-124556-root.json [12:45:57] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) :-( Thank you, Papaul! [12:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:55] 10SRE: Misplaced file in python3-service-checker - https://phabricator.wikimedia.org/T284220 (10fgiunchedi) [12:56:15] 10SRE: Misplaced file in python3-service-checker - https://phabricator.wikimedia.org/T284220 (10fgiunchedi) [12:58:55] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:00:44] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P16285 and previous config saved to /var/cache/conftool/dbconfig/20210603-130059-root.json [13:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:58] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:16:46] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:17:45] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:54] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: authorize doh300* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697942 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:19:00] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:59] (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: authorize doh300* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697942 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:23:10] (03CR) 10Ssingh: [C: 03+2] site: switch doh3001 and doh3002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/697944 (owner: 10Ssingh) [13:24:47] (03PS1) 10BBlack: Upload: limit bingbot media fetches a bit harder [puppet] - 10https://gerrit.wikimedia.org/r/697959 [13:25:51] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:36] (03PS2) 10BBlack: Upload: limit bingbot media fetches a bit harder [puppet] - 10https://gerrit.wikimedia.org/r/697959 [13:27:59] (03CR) 10Ema: [C: 03+1] Upload: limit bingbot media fetches a bit harder [puppet] - 10https://gerrit.wikimedia.org/r/697959 (owner: 10BBlack) [13:29:36] (03CR) 10BBlack: [C: 03+2] Upload: limit bingbot media fetches a bit harder [puppet] - 10https://gerrit.wikimedia.org/r/697959 (owner: 10BBlack) [13:35:43] 10SRE, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) I fear we could be quite disappointed about "corporate workstations" being on IPv6 if we went to look ;) Either way I assume we want this li... [13:39:09] (03PS1) 10Giuseppe Lavagetto: prometheus exporters: use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697969 [13:40:23] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] prometheus exporters: use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697969 (owner: 10Giuseppe Lavagetto) [13:43:08] (03PS1) 10Cathal Mooney: Added doh3001 & doh3002 to Anycast peers in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697970 (https://phabricator.wikimedia.org/T283503) [13:43:48] (03PS1) 10Ema: Upload: rate limit bingbot media fetches regardless of IP [puppet] - 10https://gerrit.wikimedia.org/r/697971 [13:44:44] (03PS2) 10Giuseppe Lavagetto: prometheus exporters: use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697969 [13:46:37] (03PS3) 10Giuseppe Lavagetto: prometheus exporters: use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697969 [13:48:57] (03PS2) 10Ema: Upload: block bingbot media fetches [puppet] - 10https://gerrit.wikimedia.org/r/697971 [13:50:42] (03CR) 10BBlack: [C: 03+1] Upload: block bingbot media fetches [puppet] - 10https://gerrit.wikimedia.org/r/697971 (owner: 10Ema) [13:50:47] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] prometheus exporters: use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697969 (owner: 10Giuseppe Lavagetto) [13:51:26] (03CR) 10Ema: [C: 03+2] Upload: block bingbot media fetches [puppet] - 10https://gerrit.wikimedia.org/r/697971 (owner: 10Ema) [13:52:43] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 158 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:54:43] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:54:44] the grafana link is phased out, fwiw ^^ [13:55:53] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:24] 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10colewhite) 05Open→03Resolved Moved to group cn=wmf. [14:04:03] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) Prometheus disk usage (106G) on prometheus3001 is larger than what can comfortably fit alongside the OS on the /dev/vda (128G) so a 150G /dev/vdb was added as /srv... [14:04:15] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) [14:05:08] (03PS1) 10Ottomata: Add krb: present for htriedman [puppet] - 10https://gerrit.wikimedia.org/r/697974 (https://phabricator.wikimedia.org/T283368) [14:05:13] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10Ottomata) @Htriedman I just created your Kerberos principal, you should receive an email asking you to log in and set your p... [14:05:58] jouncebot: next [14:05:58] In 1 hour(s) and 54 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1600) [14:06:01] jouncebot: now [14:06:01] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [14:11:16] (03CR) 10Ottomata: [C: 03+2] Add krb: present for htriedman [puppet] - 10https://gerrit.wikimedia.org/r/697974 (https://phabricator.wikimedia.org/T283368) (owner: 10Ottomata) [14:12:15] !log installing postgresql-9.6 security updates [14:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:19] 10SRE: Misplaced file in python3-service-checker - https://phabricator.wikimedia.org/T284220 (10colewhite) p:05Triage→03Low [14:17:54] (03CR) 10Muehlenhoff: [C: 03+2] Add nginx profile to apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/697739 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:18:52] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/697970 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [14:23:05] !log installing nginx security updates on buster [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:41] (03PS1) 10Muehlenhoff: Switch installserver::light to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) [14:33:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks okay to me, and the service container ought to be available by the time this callback is evaluated. But this should still be careful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 (owner: 10Zabe) [14:37:18] (03CR) 10Cathal Mooney: [C: 03+2] Added doh3001 & doh3002 to Anycast peers in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697970 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [14:38:37] (03Merged) 10jenkins-bot: Added doh3001 & doh3002 to Anycast peers in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697970 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [14:39:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:43:22] (03PS2) 10Muehlenhoff: Switch installserver::light to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) [14:45:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:54:13] !log Gerrit 697970: Add Wikidough BGP peerings on esams CRs for doh3001 and doh3002. [14:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] 10SRE, 10Analytics, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10Ottomata) Ok, for the kafka term, we no longer need any logstash hosts. kafka logging cluster used be colocated on a few logstash hosts, but no longer, they are all on kafka-loggingXXXX. This [[ ht... [14:55:40] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:35] (03PS1) 10Muehlenhoff: Switch install* servers to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/697987 (https://phabricator.wikimedia.org/T164456) [15:00:03] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:13] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:26] topranks: ^ can this be related to the wikidough change? [15:01:27] oh yeah [15:01:44] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) Peerings to doh3001 and doh3002 added on cr1-esams and cr2-esams now. Anycast range is being announced and from here in Ireland I'm hitting doh3... [15:02:29] sukhe: looking [15:02:58] (03PS1) 10Muehlenhoff: Switch testreduce to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) [15:03:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:03:26] (03CR) 10jerkins-bot: [V: 04-1] Switch testreduce to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:08:39] !log disconnect ps2-d8-codfw for replacement [15:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:03] PROBLEM - Host theemin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:25] whoa, we have a theremin in pro-- oh I see [15:14:00] (03PS2) 10Muehlenhoff: Switch testreduce to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) [15:15:25] PROBLEM - Juniper alarms on asw-d-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:16:11] PROBLEM - Host ps-test-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:17:15] RECOVERY - Juniper alarms on asw-d-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:17:33] (03CR) 10Wolfgang Kandek: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [15:18:53] (03PS1) 10RobH: moss-be100[12] setup info [puppet] - 10https://gerrit.wikimedia.org/r/697990 (https://phabricator.wikimedia.org/T276637) [15:19:33] (03CR) 10David Caro: [C: 03+2] gitignore: add vscode temp files [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697929 (owner: 10David Caro) [15:19:53] (03CR) 10RobH: [C: 03+2] moss-be100[12] setup info [puppet] - 10https://gerrit.wikimedia.org/r/697990 (https://phabricator.wikimedia.org/T276637) (owner: 10RobH) [15:21:57] (03CR) 10David Caro: [C: 04-1] openstack.cloudvirt.{un}set_maintenance: use current host aggregates (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 (owner: 10David Caro) [15:22:28] RECOVERY - Host theemin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [15:22:45] RECOVERY - Host ps-test-d8-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [15:25:55] !log upgrading gitlab to 13.11.5 [15:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] !log pdu replacement complete [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.eqiad.wmnet'] ` T... [15:30:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) [15:34:07] (03PS1) 10Ottomata: dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics database [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) [15:35:37] (03CR) 10jerkins-bot: [V: 04-1] dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics database [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:35:53] (03CR) 10Ottomata: "Hiya," [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:36:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) >>! In T276637#7132069, @ops-monitoring-bot wrote: > Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: > ` > ['moss-be1001.eqiad.wmnet', 'moss-be1... [15:36:39] (03PS2) 10Ottomata: dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) [15:37:42] (03PS1) 10Cathal Mooney: Correct IP address for doh3002 BGP peer in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697993 (https://phabricator.wikimedia.org/T283503) [15:39:19] (03PS3) 10David Caro: openstack.cloudvirt.{un}set_maintenance: use current host aggregates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 [15:43:56] (03CR) 10Ayounsi: [C: 03+1] Correct IP address for doh3002 BGP peer in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697993 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [15:44:16] (03PS1) 10Cwhite: kafka-logging: reduce retention time to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) [15:45:04] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [15:46:09] (03CR) 10Cathal Mooney: [C: 03+2] Correct IP address for doh3002 BGP peer in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697993 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [15:46:21] (03CR) 10Herron: [C: 03+1] New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/697916 (owner: 10Filippo Giunchedi) [15:46:49] (03CR) 10Herron: [C: 03+1] alertmanager: highlight 'instance' label in alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [15:46:51] (03Merged) 10jenkins-bot: Correct IP address for doh3002 BGP peer in esams [homer/public] - 10https://gerrit.wikimedia.org/r/697993 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [15:47:47] (03PS4) 10Ottomata: Set up airflow-analytics on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/697653 (https://phabricator.wikimedia.org/T272973) [15:49:07] !log Gerrit 697993: Change BGP peer IP for doh3002 on esams CRs. [15:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29787/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697653 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:50:18] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/29786/" [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [15:50:26] (03CR) 10Jcrespo: [C: 03+1] "This is ok to me- as long as this is registered on puppet, from my point of view analytics fully own this and can deploy at any time. You " [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:50:42] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 428, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:54] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:05] ^ topranks: nice :) [15:51:15] (03CR) 10Ottomata: [C: 03+2] Set up airflow-analytics on an-launcher1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697653 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:52:32] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) There was an issue with peering to doh3002 due to a problem that occurred with Netbox automation, triggered by the VM creation running twice I be... [15:53:40] (03CR) 10Herron: [C: 03+1] kafka-logging: reduce retention time to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [15:54:29] (03CR) 10Herron: [C: 03+1] alertmanager: cc -operations on IRC for all SRE pages [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [15:54:58] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:55:53] (03CR) 10Elukey: [C: 03+1] "LGTM, left a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [16:00:04] jbond42 and cdanis: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1600). [16:05:24] (03CR) 10Elukey: [C: 03+1] "One note - IIRC there is also the possibility of reducing the retention per topic via the kafka cli, if we just want to get some more free" [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [16:06:27] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [16:06:32] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: reduce retention time to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [16:07:31] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [16:09:55] (03CR) 10Zabe: [C: 03+1] Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [16:10:52] (03CR) 10Urbanecm: [C: 03+1] Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [16:14:04] (03CR) 10MSantos: [C: 03+2] Use consistent binary name for tegola builds [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/697911 (owner: 10Jgiannelos) [16:15:02] (03PS1) 10Ottomata: airflow - Expose admin details by default [puppet] - 10https://gerrit.wikimedia.org/r/697999 (https://phabricator.wikimedia.org/T272973) [16:15:21] (03Merged) 10jenkins-bot: Use consistent binary name for tegola builds [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/697911 (owner: 10Jgiannelos) [16:15:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 142 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:15:41] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) > debug1: Executing proxy command: exec ssh -a -W stat1006.ulsfo.wmnet:22 janstee@bast4003.wikimedia.org "stat1006.ulsfo.wmnet" in there caught my eye.... [16:17:11] (03CR) 10Majavah: "> Patch Set 7:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [16:17:22] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:22:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:22:31] XioNoX: did you run the cookbook after deleting the IP? ^^^ [16:22:41] volans: of course not :) [16:22:43] doing now [16:23:02] thanks, sorry :) [16:23:07] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [16:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:23] (03CR) 10Bstorm: "> Patch Set 7:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [16:29:38] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4444 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:32:25] ^ looks real, and api_appserver latency is spiking too, I'm still in this training but poking around a little [16:36:50] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:44:39] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) Thank you both -- @elukey I thought I did, but nothing worked. I am hoping the fix needed that @dzahn spotted will work, I will try that and report... [16:46:06] mutante: nice parser --^ :) [16:46:12] I totally missed it [16:46:23] the password thing threw me off [16:46:55] (03CR) 10Ottomata: [C: 03+2] airflow - Expose admin details by default [puppet] - 10https://gerrit.wikimedia.org/r/697999 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [16:50:19] (03PS1) 10Papaul: ADD new Raritan test PDU [puppet] - 10https://gerrit.wikimedia.org/r/698007 (https://phabricator.wikimedia.org/T265435) [16:51:20] (03CR) 10Papaul: [C: 03+2] ADD new Raritan test PDU [puppet] - 10https://gerrit.wikimedia.org/r/698007 (https://phabricator.wikimedia.org/T265435) (owner: 10Papaul) [16:59:33] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, and 2 others: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi new Raritan test PDU is racked in D8 https://netbox.wikimedia.org/dcim/devices/3427/ still working on librenms https://librenms.wikimedia.org/devi... [17:00:05] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T1700). [17:10:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.eqiad.wmnet'] ` The log can be found in... [17:13:05] (03CR) 10Bodhisattwa: [C: 03+1] Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [17:16:06] !log remove dropped Cassandra keyspace snapshots -- T258414 [17:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:10] T258414: Cassandra Grafana dashboards seem to disagree with actual utilization - https://phabricator.wikimedia.org/T258414 [17:17:09] (03PS1) 10Majavah: toolforge: prometheus: renew k8s TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/698009 (https://phabricator.wikimedia.org/T280301) [17:20:25] (03CR) 10Wolfgang Kandek: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [17:25:52] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5238 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:26:04] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.... [17:27:44] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) ` 17:23:22 | moss-be1001.eqiad.wmnet | Unable to run wmf-auto-reimage-host: The host moss-be1001.eqiad.wmnet should have rebooted into the newly installed Operating System but a... [17:27:53] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [17:31:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.eqiad.wmnet'] ` The log can be found in... [17:32:33] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [17:36:54] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:37:38] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) 05Open→03Resolved @ema Replaced the DIMM A6, powered on and replacement recgonized. Message PR1: Replaced part detected for device: DDR4 DIMM(Socket A6). Booted to the OS Cleared th... [17:37:47] !log gitlab1001: re-running install-gitlab-server.sh [17:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:06] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=cache_text file=device_smart.prom instance=cp1087 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:41:34] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03175 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:45:25] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1001.eqiad.wmnet with reason: REIMAGE [17:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:44] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer={backend,frontend} site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [17:47:31] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1002.eqiad.wmnet with reason: REIMAGE [17:47:33] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1001.eqiad.wmnet with reason: REIMAGE [17:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:01] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [17:49:48] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1002.eqiad.wmnet with reason: REIMAGE [17:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:38] RECOVERY - HP RAID on ms-be1053 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:51:53] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10Cmjohnson) 05Open→03Resolved replaced the disk [17:57:08] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1087 is CRITICAL: 2.726e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [17:57:28] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:58:07] cp1087 was just booted (was down for maintenance) [17:58:16] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['moss-be1001.eqiad.wmnet', 'moss-be1002.eqiad.wmnet'] ` and were **ALL** successful. [17:59:18] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [17:59:51] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10Htriedman) 05Resolved→03Open Hi all — reopening this task so that I can get access to https://superset.wikimedia.org as a Hive GUI. Tagging @... [18:03:06] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:03:37] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10elukey) Added the user to the `wmf` LDAP group. @Htriedman can you retry now? You are probably going to check https://superset.wikimedia.org/super... [18:04:01] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/29788/install1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [18:04:06] elukey: <3 ^ thank you [18:04:18] (for handlinlg access r) [18:05:04] ottomata: I didn't mean to step on your toes, you can blame Reedy :D [18:05:43] haha [18:07:14] (03CR) 10Dzahn: "deployed. noop on install1003, install3001, install2003" [puppet] - 10https://gerrit.wikimedia.org/r/697978 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [18:08:52] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production analytics data and cluster for htriedman - https://phabricator.wikimedia.org/T283368 (10Htriedman) 05Open→03Resolved It's working perfectly! Thanks so much for the responsiveness. [18:09:33] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/29789/install1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697987 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [18:09:40] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) [18:09:50] mutante: hi! wanted to talk about this :D ^ [18:10:15] sukhe: :) yes, happy to do that. no problem [18:10:16] first, I wanted to thank you for doing this, every time! I wanted to mention that the reason I don't do it myself is to not break protocol [18:10:30] however, please don't treat this as urgent since we already have deployments on two clusters :) [18:10:45] but again, thanks very much! [18:11:21] you're welcome. it's great you are doing it this way [18:11:50] do you want to stick to the 30G even though esams has 10G now? [18:15:04] yeah that's fair. I think we can do 10G, let's try conserving space and see how it works; our disk requirements are not going to change anyway [18:15:18] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) [18:16:11] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@659a8e4]: resolve npe in datawriter [18:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:26] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@659a8e4]: resolve npe in datawriter (duration: 00m 15s) [18:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:31] sukhe: If we had enough resources and would start over I would have said 20G.. just thought consistency made more sense one way or the other [18:19:34] yeah :) I would have also probably called the hosts "dough" and not "doh" as we already confused them with "dns" once today but too late :D [18:20:40] sukhe: https://en.wikipedia.org/wiki/D%27oh%21 [18:20:57] that's what I just heard internally [18:22:08] haha yeah, so at least we have that going for us [18:22:18] Homer was required to utter what was written in the script as an "annoyed grunt".[6] Dan Castellaneta rendered it as a drawn out "d'ooooooh". [18:25:52] (03PS1) 10Ssingh: site: add wikidough eqsin with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698013 [18:28:14] (03PS1) 10Ssingh: acme_chief: authorize doh500* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) [18:28:22] !log temp. disabling puppet on install* servers. switching nginx to light variant (T164456) [18:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:26] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [18:28:46] (03PS2) 10Ssingh: site: add wikidough eqsin with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698013 (https://phabricator.wikimedia.org/T284246) [18:29:38] PROBLEM - SSH on cp1087 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:29:53] ^ yea, we know that is broken or something [18:30:28] (03CR) 10Nikki Nikkhoui: "Open to any suggestions/thoughts on where i went wrong" [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [18:31:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29790/console" [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh) [18:31:54] RECOVERY - SSH on cp1087 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:34:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:34:20] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) 05Resolved→03Open a:05Cmjohnson→03ema https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1087 [18:36:42] (03CR) 10Bstorm: [C: 03+2] toolforge: prometheus: renew k8s TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/698009 (https://phabricator.wikimedia.org/T280301) (owner: 10Majavah) [18:36:44] (03CR) 10Dzahn: [C: 03+1] acme_chief: authorize doh500* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh) [18:37:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729 [18:37:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729 [18:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] !log [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` (blazegraph on the host has been locked up for ~16 hours based off of https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1622683465757&to=1622745461547) [18:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:45] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729 [18:38:49] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @elukey the correction to equid host did not resolve the problem and terminal continues to ask for the passphrase. Should I just delete the existing... [18:39:53] !log [WDQS] depooled `wdqs1012` (has ~15 hours of lag to catch up on) [18:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:02] (03CR) 10Dzahn: [C: 03+2] site: add wikidough eqsin with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698013 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh) [18:41:01] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "disabled puppet on install* via cumin, merging.." [puppet] - 10https://gerrit.wikimedia.org/r/697987 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [18:42:13] any OS installs going on? brief moment of maintenance possible on install* nginx, brb [18:46:18] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1005.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [18:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:22] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [18:46:24] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2005.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [18:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:42] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) Would you mind pasting the contents of ` /Users/janstee/.ssh/config`, Jaime? One more thing to try is, try to SSH just to the bastion host directly, and... [18:51:30] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@f40d41a]: resolve npe in datawriter [18:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:49] PROBLEM - WDQS high update lag on wdqs1012 is CRITICAL: 5.409e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:52:02] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@f40d41a]: resolve npe in datawriter (duration: 00m 31s) [18:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1012 is CRITICAL: 5.398e+04 ge 4.32e+04 Ryan Kemper host already depooled needs to catch up on 15 hours of lag https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:57:53] Request from - via cp3054.esams.wmnet, ATS/8.0.8 [18:57:53] Error: 408, Inactive Timeout at 2021-06-03 18:53:28 GMT [18:58:07] When I tried to save a page. [19:00:15] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:00:58] (03PS1) 10Andrew Bogott: Trove config: add some configs for redis and postgres [puppet] - 10https://gerrit.wikimedia.org/r/698017 (https://phabricator.wikimedia.org/T212595) [19:01:00] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1005.eqiad.wmnet with reason: REIMAGE [19:01:03] ShakespeareFan00: is that failure repeatable, or a "blip" in one interaction? [19:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:03] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) ` 440601 Jun 2 23:05:38 bast4003 sshd[6968]: Accepted key ED25519 SHA256:plaVmNDA1Ug/00RQCUV2WfIKRDNwP7GLq9NouyMKMJM found at /etc/ssh/userkeys/janstee:1... [19:02:33] a blip [19:02:39] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:03:04] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2005.codfw.wmnet with reason: REIMAGE [19:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1005.eqiad.wmnet with reason: REIMAGE [19:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:34] ShakespeareFan00: ack. worth keeping our ears open for more reports then, but probably not easy to dig into the cause of yet [19:04:37] (03CR) 10Andrew Bogott: [C: 03+2] Trove config: add some configs for redis and postgres [puppet] - 10https://gerrit.wikimedia.org/r/698017 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:05:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2005.codfw.wmnet with reason: REIMAGE [19:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) [19:06:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10RobH) 05Open→03Resolved @fgiunchedi these are now ready for your use! [19:14:36] !log install1003 - restarting nginx after we switched from nginx-full to nginx-light package, same on other install servers T164456 [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:42] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [19:19:21] (03PS1) 10Jbond: add .gitreview file [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 [19:22:13] 10SRE, 10Traffic, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10Dzahn) checked on install* that nginx-full is gone, nginx-light is there and restarted nginx to be sure this did not remove other nginx-* module packages though [19:23:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh5001.wikimedia.org [19:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:39] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1013.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [19:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:42] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [19:27:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [19:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:00] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [19:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:51] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service daniel_zahn https://phabricator.wikimedia.org/T251918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:21] 10SRE, 10serviceops, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Dzahn) 2021-06-03 19:19:30 3d 2h 47m 35s 3/3 CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service [19:32:11] !log [deneb:~] $ sudo systemctl start docker-reporter-releng-images [19:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:55] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:17] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:28] !log [deneb:~] $ sudo systemctl start docker-reporter-releng-images - T251918 - icinga-wm> RECOVERY - Check systemd state on deneb is OK [19:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:32] T251918: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 [19:34:13] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:44] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@339d402]: ship pip and wheel packages for virtualenvs [19:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:57] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5001.wikimedia.org [19:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:12] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@339d402]: ship pip and wheel packages for virtualenvs (duration: 04m 27s) [19:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:52] 10SRE, 10serviceops, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Dzahn) May 31 16:02:36 deneb docker-report-releng[31493]: ERROR[docker-report] Debmonitor report for image docker-registry.wikimedia.org/releng/quibble-jessie-php56:0.0.31-1 f... [19:40:16] (03PS1) 10Andrew Bogott: trove.conf: rename 'postgres' to 'postgresql' [puppet] - 10https://gerrit.wikimedia.org/r/698022 [19:41:06] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh5002.wikimedia.org [19:41:06] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh5002.wikimedia.org [19:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:10] (03CR) 10Andrew Bogott: [C: 03+2] trove.conf: rename 'postgres' to 'postgresql' [puppet] - 10https://gerrit.wikimedia.org/r/698022 (owner: 10Andrew Bogott) [19:45:46] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) doh5001 has been created but doh5002 hit resource limits here as well, even though we just used 10G disk, it is maybe another resource: ` dzahn@cu... [19:46:35] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3492 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:48:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:49:25] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ssingh @BBlack Our issue over here is lack of the resource of .. public IPs, it looks: ` 13729 File "/usr/lib/python3/dist-packages/spicerack/_... [19:53:30] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) {F34479670} [19:56:32] !log [mwmaint1002:~] $ sudo systemctl start daily_account_consistency_check.service [19:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:45] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.381 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:58:15] !log [mwmaint1002:~] $ /usr/local/bin/systemd-timer-mail-wrapper -T root@mwmaint1002.eqiad.wmnet --only-on-error /usr/local/bin/cross-validate-accounts [19:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:43] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:18] (03PS1) 10Ebernhardson: Undeploy mjolnir profile from analytics [puppet] - 10https://gerrit.wikimedia.org/r/698025 (https://phabricator.wikimedia.org/T265547) [20:05:53] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3492 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:09:03] (03PS1) 10Dzahn: DHCP: add doh5001 MAC, add doh[2345] to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) [20:09:19] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add doh5001 MAC, add doh[2345] to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [20:09:48] (03PS2) 10Dzahn: DHCP: add doh5001 MAC, add doh[2345] to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) [20:15:14] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10Dzahn) ACK. thanks all. Several people have mentioned the part that we have metal there this could move to. It seems to make sense to me to move prometheus to that and out... [20:15:59] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @dzahn Here is my config file paste: Host * ForwardAgent no IdentitiesOnly yes Host * AddKeysToAgent yes UseKeychain yes # From https... [20:17:15] 10SRE, 10observability, 10serviceops-radar, 10User-fgiunchedi: Prometheus PoPs disk space utilization - https://phabricator.wikimedia.org/T277163 (10Dzahn) ACK, thanks, @fgiunchedi As commented on the linked ticket, multiple people have mentioned we have metal there that prometheus could move to. Seems to... [20:19:09] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:20:23] (03PS1) 10Andrew Bogott: Trove: add support config for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/698051 (https://phabricator.wikimedia.org/T212595) [20:21:11] (03CR) 10Andrew Bogott: [C: 03+2] Trove: add support config for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/698051 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [20:24:25] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:27:58] (03CR) 10Brennen Bearnes: [C: 03+1] "Think we're ready for this." [puppet] - 10https://gerrit.wikimedia.org/r/696024 (https://phabricator.wikimedia.org/T276144) (owner: 10Jbond) [20:31:40] (03CR) 10Muehlenhoff: DHCP: add doh5001 MAC, add doh[2345] to partman regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [20:32:51] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) Ok, thank you. Hmm.. Let's try this: Comment out or temp. remove these lines: ` > Host * > AddKeysToAgent yes > UseKeychain yes ` and then `ssh -i... [20:33:24] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) 05Resolved→03Open [20:33:57] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) If it turns out you need to make new keys, just paste the public part here and ask us to update the repository. [20:34:01] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [20:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:09] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [20:34:10] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@1c40c83]: bulk daemon: accept events for search_updates swift container [20:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:12] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [20:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:32] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1013.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [20:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:10] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@1c40c83]: bulk daemon: accept events for search_updates swift container (duration: 01m 00s) [20:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:02] !log restart mjolnir-kafka-bulk-daemon on search-loader[12]001 [20:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:16] (03CR) 10Dzahn: DHCP: add doh5001 MAC, add doh[2345] to partman regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [20:39:35] (03CR) 10Ebernhardson: "This should be all ready to go. The code access mjolnir through python packages has been deployed to analytics, and the search/mjolnir/dep" [puppet] - 10https://gerrit.wikimedia.org/r/698025 (https://phabricator.wikimedia.org/T265547) (owner: 10Ebernhardson) [20:41:30] (03PS3) 10Dzahn: DHCP: add doh5001 MAC, add doh[2345] to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) [20:42:38] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) Checking in here @Eugene.chernov, any blockers? [20:42:51] Hey all - was going to deploy the security patch for T282932 unless anybody has any objections. [20:46:53] (03CR) 10Dzahn: DHCP: add doh5001 MAC, add doh[2345] to partman regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [20:47:34] !log Deployed security patch for T282932 [20:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:27] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10BBlack) @RBrounley_WMF I think he's waiting on me, sorry! Will sync up with him [20:51:39] (03CR) 10Cwhite: [C: 03+2] kafka-logging: reduce retention time to 5 days [puppet] - 10https://gerrit.wikimedia.org/r/697995 (https://phabricator.wikimedia.org/T284233) (owner: 10Cwhite) [20:54:08] !log restart kafka on kafka-logging to take new retention config [20:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 74 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:00:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 42 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:10:47] (03CR) 10Bstorm: [C: 03+2] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:11:22] (03CR) 10Bstorm: "Hrm. Need to deal with the parent before merge." [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:12:19] (03CR) 10Bstorm: [C: 03+1] "I'll merge it after the parent is merged (or anyone else with +2 can). 😊" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:20:08] (03CR) 10Ladsgroup: "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:28:55] (03CR) 10Bstorm: [C: 04-1] "This makes me nervous. There are years of of configuration in hiera that no not overlap with cloud that would suddenly overlap heavily, so" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [21:30:01] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [21:39:56] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) I created new keys, please use the following to update the repository: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKIp5RxtQOU35h+P/B+MgpSarZJnr73c8aIMBGEa... [21:42:22] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:42:35] (03PS3) 10Bstorm: dumps: Migrate rsync of nginxlogs from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:44:12] (03CR) 10Bstorm: [C: 03+2] dumps: Migrate rsync of nginxlogs from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:47:12] (03PS8) 10Jbond: Add CAS authentication support [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [21:47:25] volans: ^^ [21:47:46] * jbond will test it tomorrow [21:48:01] (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/697130 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:51:27] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:15] jbond: go to bed! [21:55:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:03] * jbond grabs his blanky ang gose to bed [21:56:47] (03CR) 10Volans: [C: 03+1] "LGTM. On a side note I think we should revisit how we manage this repository. IIRC it was used to rebase our patches on top of upstream, a" [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 (owner: 10Jbond) [21:57:59] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 4699 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:04:38] (03PS1) 10Bstorm: galera: ensure that mariabackup is also installed [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) [22:15:31] !log robh@cumin1001 START - Cookbook sre.dns.netbox [22:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:29] RECOVERY - WDQS high update lag on wdqs1012 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.156e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:22:04] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) No problem, thanks @BBlack! [22:22:17] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 1184 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:28:19] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10colewhite) [22:28:26] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:55] (03PS1) 10Razzi: kerberos: add krb: present for jdl [puppet] - 10https://gerrit.wikimedia.org/r/698067 (https://phabricator.wikimedia.org/T284081) [22:32:01] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10colewhite) Hi @LZaman! I've updated the task description with the information we need to proceed. Please confirm that the wikitech username is correct, fill out the "Reason for access" secti... [22:32:13] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10colewhite) p:05Triage→03Medium [22:32:56] (03PS1) 10Razzi: kerberos: add krb: present for phuedx [puppet] - 10https://gerrit.wikimedia.org/r/698068 (https://phabricator.wikimedia.org/T284096) [22:34:59] (03CR) 10Razzi: [C: 03+2] kerberos: add krb: present for phuedx [puppet] - 10https://gerrit.wikimedia.org/r/698068 (https://phabricator.wikimedia.org/T284096) (owner: 10Razzi) [22:35:13] (03PS2) 10Razzi: kerberos: add krb: present for jdl [puppet] - 10https://gerrit.wikimedia.org/r/698067 (https://phabricator.wikimedia.org/T284081) [22:35:15] !log T280382 `wdqs2005.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` [22:35:18] (03CR) 10Bstorm: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [22:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:19] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:36:09] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:19] (03CR) 10Razzi: [C: 03+2] kerberos: add krb: present for jdl [puppet] - 10https://gerrit.wikimedia.org/r/698067 (https://phabricator.wikimedia.org/T284081) (owner: 10Razzi) [22:36:43] !log T280382 Cancelled transfer to `wdqs1005`; the source host `wdqs1013` has a `wikidata.jnl` that is 80% too big; will transfer from different node -> `wdqs1005` and then fix the journal on `wdqs1013` after [22:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:08] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) [22:39:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:41] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1008.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [22:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:53] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) p:05Triage→03Medium Hi @BVershbow_WMF! I've added the information we need to proceed in the description. Would you please fill out any missing information and confirm tha... [22:40:10] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) [22:41:13] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2001.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [22:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:17] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:41:39] 10SRE, 10Wikimedia-Mailing-lists: Add link to list archives in default footer - https://phabricator.wikimedia.org/T284256 (10colewhite) p:05Triage→03Medium [22:42:18] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10colewhite) p:05Triage→03Medium [22:44:45] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:45:09] (03PS1) 10Ryan Kemper: wdqs-internal: lower depool threshold to .3 [puppet] - 10https://gerrit.wikimedia.org/r/698069 (https://phabricator.wikimedia.org/T284264) [22:45:10] (03PS2) 10Zabe: Restrict changetags to sysops and bots on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) [22:46:07] (03PS3) 10Zabe: Restrict changetags to sysops and bots on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) [22:47:05] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29791/console" [puppet] - 10https://gerrit.wikimedia.org/r/698069 (https://phabricator.wikimedia.org/T284264) (owner: 10Ryan Kemper) [22:51:09] (03PS4) 10Zabe: Restrict changetags to sysops and bots on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) [23:00:04] brennen: (Dis)respected human, time to deploy US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T2300). Please do the needful. [23:00:04] DannyS712 and Zabe: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] * thcipriani waves [23:00:16] o/ [23:00:25] (03PS1) 10Dzahn: static-bugzilla: add config to serve compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/698070 [23:00:56] hi zabe [23:00:58] DannyS712: around for backport? [23:01:02] (03CR) 10Dzahn: [C: 03+2] DHCP: add doh5001 MAC, add doh[2345] to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/698047 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [23:01:10] hi [23:02:08] (here) [23:02:45] (03CR) 10Thcipriani: [C: 03+2] "Config backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) (owner: 10Zabe) [23:03:34] (03Merged) 10jenkins-bot: Restrict changetags to sysops and bots on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) (owner: 10Zabe) [23:04:02] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) [23:04:04] sukhe: btw, considered bullseye?:) [23:05:22] zabe: if you can check your change it's live on mwdebug1001 [23:05:38] doing [23:06:34] thcipriani: works the supposed way [23:06:39] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) @KFrancis can you confirm an NDA on file for @Cervisiarius? [23:06:54] zabe: cool, going live now [23:07:15] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) a:03Dzahn [23:09:03] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:694686|Restrict changetags to sysops and bots on meta]] T283625 (duration: 00m 58s) [23:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:07] T283625: Revoke changetags permissions from the 'Users' group - https://phabricator.wikimedia.org/T283625 [23:09:08] ^ zabe live now [23:09:18] thanks :) [23:09:41] thanks for the patch [23:11:23] DannyS712: 2nd ping for backport window [23:11:56] (03CR) 10Cwhite: [C: 03+1] alertmanager: cc -operations on IRC for all SRE pages [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [23:13:22] (03CR) 10Cwhite: [C: 03+1] alertmanager: highlight 'instance' label in alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [23:13:24] (03CR) 10Dzahn: [C: 03+1] "seems a good intention and code is the same as used above, interesting how it works now" [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [23:14:54] (03CR) 10Cwhite: [C: 03+1] "LGTM (untested)" [debs/karma] - 10https://gerrit.wikimedia.org/r/697916 (owner: 10Filippo Giunchedi) [23:15:33] (03PS1) 10Dzahn: admin: replace SSH key for janstee [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) [23:18:42] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) @JAnstee_WMF Alright, I made a patch to replace your key and uploaded it to code review. https://gerrit.wikimedia.org/r/c/operations... [23:25:45] jouncebot: now [23:25:45] For the next 0 hour(s) and 34 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210603T2300) [23:25:48] jouncebot: next [23:25:48] In 7 hour(s) and 34 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210604T0700) [23:26:02] (03PS7) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [23:28:32] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2001.codfw.wmnet with reason: REIMAGE [23:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:50] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:30:40] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2001.codfw.wmnet with reason: REIMAGE [23:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:49] (03CR) 10Reedy: [C: 03+2] Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [23:33:35] (03Merged) 10jenkins-bot: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [23:33:35] !log installing OS on fresh VM doh5001 [23:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:13] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T280886 (duration: 00m 57s) [23:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:17] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [23:41:33] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T280886 (duration: 00m 56s) [23:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:41] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST