[00:00:04] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T0000). [00:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:24] present [00:01:54] Is this the backport window we tend to struggle to find someone for? [00:04:06] brennen: dancy either of you around as I'm hoping to squash https://phabricator.wikimedia.org/T300746 / https://phabricator.wikimedia.org/T299971 [00:12:45] (03PS2) 10Cwhite: idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) [00:13:04] Jdlrobson: hey, here [00:13:16] brennen: hey are you able to help with the backport window? [00:13:26] yeah [00:13:30] one sec [00:13:30] sweet. Thank you! [00:14:21] Jdlrobson: start from the top of the list? [00:14:49] brennen: yes please [00:15:00] it's going to be 2 or 3 changes, depending on how well 2 goes [00:15:13] i'll need to check the logs after syncing to see if it has the desired effect (errors disappearing from logs) [00:15:31] (03PS3) 10Cwhite: idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) [00:16:12] (03CR) 10Cwhite: idp, grafana: configure grafana-next-rw for sso (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [00:16:15] (03CR) 10Brennen Bearnes: [C: 03+2] Changes the labels of the Vector skins [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759308 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:16:48] (03PS2) 10Cwhite: hiera: add grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) [00:18:21] (03PS3) 10Cwhite: hiera: add grafana-next and grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) [00:18:43] (03CR) 10Cwhite: hiera: add grafana-next and grafana-next-rw to grafana public_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [00:18:54] (03PS1) 10Papaul: Add Papaul to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759364 (https://phabricator.wikimedia.org/T300660) [00:19:26] Jdlrobson: is there a backport missing for https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/759343 ? [00:20:15] (03PS1) 10Jdlrobson: Pass skin name to Hooks::isSkinLegacy [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759309 (https://phabricator.wikimedia.org/T299971) [00:20:22] Sorry! This one first ^ [00:20:55] (03PS2) 10Papaul: Add Papaul to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759364 (https://phabricator.wikimedia.org/T300660) [00:21:00] you can sync that one at the same time as the Vector change [00:21:10] they don't depend on each other [00:21:25] (03CR) 10Brennen Bearnes: [C: 03+2] Pass skin name to Hooks::isSkinLegacy [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759309 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [00:29:38] Jdlrobson: i just realized that this first one is an i18n/en.json change, which probably necessitates a sync-world, yeah? [00:30:59] brennen: sadly yes [00:31:17] (03Merged) 10jenkins-bot: Changes the labels of the Vector skins [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759308 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:32:09] (03CR) 10Cwhite: hiera: set domainrw to grafana-next-rw in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [00:37:13] Jdlrobson: pulled that one to mwdebug, but i'm guessing there's a step here for testing the new i18n that i don't know about. i guess this ought to be safe to sync? [00:37:14] (03Merged) 10jenkins-bot: Pass skin name to Hooks::isSkinLegacy [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759309 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [00:38:24] brennen: should be safe to sync yes [00:40:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) [00:41:01] Jdlrobson: k, second one is on mwdebug1002 if there's anything to test there, otherwise i'll go ahead with sync-world for both. [00:42:14] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:38] brennen: you can sync that too [00:42:41] then we wait for a bit [00:43:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2029.codfw.wmnet with OS buster [00:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2029.codfw.w... [00:44:46] !log brennen@deploy1002 Started scap: Backports: [[gerrit:759308|Changes the labels of the Vector skins (T299927)]] and [[gerrit:759309|Pass skin name to Hooks::isSkinLegacy (T299971)]] [00:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:52] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [00:44:52] T299971: [subtask] Problem in Legacy Vector calculation ([{reqId}] {exception_url} PHP Notice: Undefined index: data-user-page ) - https://phabricator.wikimedia.org/T299971 [00:46:22] jouncebot: next [00:46:22] In 0 hour(s) and 13 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T0100) [00:46:44] twentyafterfour: fyi, i think this backport window is going to drag out a bit. [00:47:05] brennen: are they synced now? [00:47:30] in progress. this takes a while. [00:47:34] brennen: ack [00:48:37] (03CR) 10Cwhite: initial sketch of watchrat alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [00:48:46] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:58] we're at sync-apaches. [00:58:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:05] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T0100). [01:02:02] (still cookin' on this backport window, although i'm fuzzy now on whether we were doing an actual phab deploy tonight or next week.) [01:04:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:04:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:59] brennen: we can try a deploy if you still are up for it after backport [01:06:50] brennen: looks like sync completed? [01:06:55] at least.. i'm seeing results [01:07:12] scap-cdb-rebuild is still underway [01:07:21] also looks like we should backport that change to wmf20 as I've seen no production errors in last 10 mins [01:07:37] https://usercontent.irccloud-cdn.com/file/Yj22lQXM/Screen%20Shot%202022-02-02%20at%205.07.34%20PM.png [01:07:52] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [01:08:00] we can do that tomorrow though if we are short on time [01:08:13] since wmf20 is not due on French/PT until tomorrow. [01:09:27] (03PS1) 10Jdlrobson: Pass skin name to Hooks::isSkinLegacy [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759310 (https://phabricator.wikimedia.org/T299971) [01:09:35] !log brennen@deploy1002 Finished scap: Backports: [[gerrit:759308|Changes the labels of the Vector skins (T299927)]] and [[gerrit:759309|Pass skin name to Hooks::isSkinLegacy (T299971)]] (duration: 24m 48s) [01:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:40] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [01:09:40] T299971: [subtask] Problem in Legacy Vector calculation ([{reqId}] {exception_url} PHP Notice: Undefined index: data-user-page ) - https://phabricator.wikimedia.org/T299971 [01:09:49] ok, sync actually finished [01:09:51] Jdlrobson: if you're good with where we're at, i'd like to move on to the phab deploy. [01:10:02] no problem. I'll do it tomorrow. Thanks a bunch brennen [01:10:10] thanks - have a good one! [01:10:32] twentyafterfour: still up for it if you are [01:10:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:19] ok gimme about two minutes to finish my burito [01:11:39] twentyafterfour: no rush. grabbing a tea, in the google meet [01:11:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2029.codfw.wmnet with OS buster [01:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2029.codfw.wmnet... [01:12:13] !log UTC late backport window finished [01:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:19] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10AntiCompositeNumber) Another report of user-facing impact in `#mediawiki` from someone using the w3m browser: https://wm-bot.wmcloud.org/logs/%23mediawiki/20220... [02:31:12] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:38:37] if anyone cared, phab deploy didn't happen because we took too long going over the details about phabricator translations and whatnot. will attempt phab deployment sometime tomorrow [03:02:02] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:21:20] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) Hm, reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye. I assume that's an oversight and not an intentional holdback, but so far I h... [03:38:18] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10dr0ptp4kt) Before we touch `no-transform` I'm requesting that @marayana and team mak... [03:50:42] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:06:22] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1486.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, mov [05:23:02] ://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:24:14] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 53.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:26:28] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 103 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:45:04] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:13:54] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1284.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:16:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:16:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298558)', diff saved to https://phabricator.wikimedia.org/P19994 and previous config saved to /var/cache/conftool/dbconfig/20220203-061703-marostegui.json [06:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:08] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [06:17:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:17:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:19:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:20:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:48] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:22:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:22:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T300402)', diff saved to https://phabricator.wikimedia.org/P19995 and previous config saved to /var/cache/conftool/dbconfig/20220203-062243-marostegui.json [06:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:48] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [06:25:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298558)', diff saved to https://phabricator.wikimedia.org/P19996 and previous config saved to /var/cache/conftool/dbconfig/20220203-062556-marostegui.json [06:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:01] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [06:26:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300402)', diff saved to https://phabricator.wikimedia.org/P19997 and previous config saved to /var/cache/conftool/dbconfig/20220203-062627-marostegui.json [06:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:52] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:38:24] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759379 (https://phabricator.wikimedia.org/T300775) [06:40:31] (03PS1) 10Marostegui: filtered_tables.txt: Add tl_target_id [puppet] - 10https://gerrit.wikimedia.org/r/759380 (https://phabricator.wikimedia.org/T300775) [06:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19998 and previous config saved to /var/cache/conftool/dbconfig/20220203-064101-marostegui.json [06:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19999 and previous config saved to /var/cache/conftool/dbconfig/20220203-064131-marostegui.json [06:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20000 and previous config saved to /var/cache/conftool/dbconfig/20220203-065606-marostegui.json [06:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P20001 and previous config saved to /var/cache/conftool/dbconfig/20220203-065636-marostegui.json [06:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:41] (03PS1) 10Elukey: ml-services: add service account config for editquality's transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/759381 [07:04:00] PROBLEM - puppet last run on orespoolcounter2003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:04:10] PROBLEM - puppet last run on orespoolcounter1004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:05:10] PROBLEM - puppet last run on orespoolcounter1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:05:46] PROBLEM - puppet last run on orespoolcounter2004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:06:13] this is me, I disabled puppet and re-enabled only now --^ [07:06:35] running puppet now [07:08:58] (03PS1) 10Elukey: role::deployment_server::kubernetes: update check_disk monitor [puppet] - 10https://gerrit.wikimedia.org/r/759384 [07:09:46] (03CR) 10Elukey: [C: 03+2] ml-services: add service account config for editquality's transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/759381 (owner: 10Elukey) [07:10:24] RECOVERY - puppet last run on orespoolcounter2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:10:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33564/console" [puppet] - 10https://gerrit.wikimedia.org/r/759384 (owner: 10Elukey) [07:10:34] RECOVERY - puppet last run on orespoolcounter1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:11:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298558)', diff saved to https://phabricator.wikimedia.org/P20002 and previous config saved to /var/cache/conftool/dbconfig/20220203-071111-marostegui.json [07:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:16] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [07:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300402)', diff saved to https://phabricator.wikimedia.org/P20003 and previous config saved to /var/cache/conftool/dbconfig/20220203-071141-marostegui.json [07:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:45] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:11:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [07:11:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [07:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [07:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [07:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:18] RECOVERY - puppet last run on orespoolcounter1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:12:20] RECOVERY - puppet last run on orespoolcounter2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:13:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:13:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T300402)', diff saved to https://phabricator.wikimedia.org/P20004 and previous config saved to /var/cache/conftool/dbconfig/20220203-071348-marostegui.json [07:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:17:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300402)', diff saved to https://phabricator.wikimedia.org/P20005 and previous config saved to /var/cache/conftool/dbconfig/20220203-071732-marostegui.json [07:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:37] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:18:57] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Since it is relatively easy to rollout/rollback I am going to merge and fix if needed later on :)" [puppet] - 10https://gerrit.wikimedia.org/r/759384 (owner: 10Elukey) [07:21:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:23:07] (03PS2) 10Marostegui: mariadb: Promote db1159 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/759222 (https://phabricator.wikimedia.org/T300329) [07:23:30] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2078,2133].codfw.wmnet,db[1117,1159,1183].eqiad.wmnet with reason: Switchover m2 T300329 [07:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:34] T300329: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 [07:23:34] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2078,2133].codfw.wmnet,db[1117,1159,1183].eqiad.wmnet with reason: Switchover m2 T300329 [07:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:40] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:28] RECOVERY - Disk space on deploy1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [07:26:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33565/console" [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [07:29:04] (03CR) 10Elukey: [V: 03+1 C: 03+1] "LGTM! Let's merge it so we can clear some alerts :)" [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [07:29:39] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [07:31:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:31:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [07:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:31:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] ACKNOWLEDGEMENT - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service Elukey T300062 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298558)', diff saved to https://phabricator.wikimedia.org/P20006 and previous config saved to /var/cache/conftool/dbconfig/20220203-073129-marostegui.json [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:33] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [07:32:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20007 and previous config saved to /var/cache/conftool/dbconfig/20220203-073237-marostegui.json [07:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298558)', diff saved to https://phabricator.wikimedia.org/P20008 and previous config saved to /var/cache/conftool/dbconfig/20220203-073735-marostegui.json [07:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:40] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [07:47:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20009 and previous config saved to /var/cache/conftool/dbconfig/20220203-074742-marostegui.json [07:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:12] 10SRE, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech, 10Patch-For-Review: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10elukey) 05Resolved→03Open Hi folks, on miscweb1002 I see the following in puppet: ` Feb 3 07:40:34 miscweb1002 puppet-agent[31666]: (/Stage[main]/Profil... [07:49:26] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10Peachey88) [07:52:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20010 and previous config saved to /var/cache/conftool/dbconfig/20220203-075240-marostegui.json [07:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:14] RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:55:58] <_joe_> !log restarted php-fpm on wtp1029, segfaulting [07:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300402)', diff saved to https://phabricator.wikimedia.org/P20011 and previous config saved to /var/cache/conftool/dbconfig/20220203-080247-marostegui.json [08:02:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:02:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:52] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T300402)', diff saved to https://phabricator.wikimedia.org/P20012 and previous config saved to /var/cache/conftool/dbconfig/20220203-080254-marostegui.json [08:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:02] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-05-04 06:59:52 +0000 (expires in 89 days) https://phabricator.wikimedia.org/tag/toolforge/ [08:05:18] (03PS1) 10Majavah: P:acme_chief: set watchdog_sec default on cloud [puppet] - 10https://gerrit.wikimedia.org/r/759439 (https://phabricator.wikimedia.org/T292619) [08:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300402)', diff saved to https://phabricator.wikimedia.org/P20013 and previous config saved to /var/cache/conftool/dbconfig/20220203-080637-marostegui.json [08:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20014 and previous config saved to /var/cache/conftool/dbconfig/20220203-080745-marostegui.json [08:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:55] (03CR) 10Majavah: "PCC (no-op in prod, PCC fails in a WMCS acme-chief host due to missing dummy secrets in labs/private): https://puppet-compiler.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/759439 (https://phabricator.wikimedia.org/T292619) (owner: 10Majavah) [08:10:45] !log restarting blazegraph on wdqs1013 (jvm stuck for 5hours) [08:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/759222 (https://phabricator.wikimedia.org/T300329) (owner: 10Marostegui) [08:13:11] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [08:18:58] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20015 and previous config saved to /var/cache/conftool/dbconfig/20220203-082142-marostegui.json [08:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298558)', diff saved to https://phabricator.wikimedia.org/P20016 and previous config saved to /var/cache/conftool/dbconfig/20220203-082249-marostegui.json [08:22:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [08:22:54] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [08:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:22:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298558)', diff saved to https://phabricator.wikimedia.org/P20017 and previous config saved to /var/cache/conftool/dbconfig/20220203-082302-marostegui.json [08:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298558)', diff saved to https://phabricator.wikimedia.org/P20018 and previous config saved to /var/cache/conftool/dbconfig/20220203-082710-marostegui.json [08:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:37] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) All pre-failover steps are done [08:27:53] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [08:29:30] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [08:32:18] (03PS1) 10Elukey: custom_deploy.d: improve ml-serve's istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/759441 [08:33:25] (03PS2) 10Elukey: custom_deploy.d: improve ml-serve's istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/759441 [08:36:16] (03PS1) 10Majavah: openstack: fix up check_flavor_properties [puppet] - 10https://gerrit.wikimedia.org/r/759443 [08:36:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20019 and previous config saved to /var/cache/conftool/dbconfig/20220203-083647-marostegui.json [08:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:33] (03PS3) 10Elukey: Improve ml-serve's istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/759441 [08:42:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20020 and previous config saved to /var/cache/conftool/dbconfig/20220203-084215-marostegui.json [08:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:51] (03CR) 10Elukey: [C: 03+2] Improve ml-serve's istio configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/759441 (owner: 10Elukey) [08:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300402)', diff saved to https://phabricator.wikimedia.org/P20021 and previous config saved to /var/cache/conftool/dbconfig/20220203-085151-marostegui.json [08:51:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:51:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:57] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20022 and previous config saved to /var/cache/conftool/dbconfig/20220203-085159-marostegui.json [08:52:00] (03CR) 10Muehlenhoff: "Created the group and added it to https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups" [puppet] - 10https://gerrit.wikimedia.org/r/759264 (owner: 10Muehlenhoff) [08:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] (03PS8) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [08:54:31] (03PS1) 10Muehlenhoff: Add cn=idptest-users to groups affected by offboarding [puppet] - 10https://gerrit.wikimedia.org/r/759446 [08:55:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20023 and previous config saved to /var/cache/conftool/dbconfig/20220203-085545-marostegui.json [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20024 and previous config saved to /var/cache/conftool/dbconfig/20220203-085720-marostegui.json [08:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:02] !log Failover m2 from db1183 to db1159 - T300329 [09:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:06] T300329: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 [09:00:48] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:04] It should all be done by now [09:01:07] I am checking services [09:01:50] * akosiaris checking ticket.wikimedia.org and mwaddlink [09:02:06] orch has the right topology [09:02:30] does someone have otrs loging to test it? [09:02:34] ticket.wikimedia.org is ok [09:02:38] akosiaris: <3 [09:02:57] (03CR) 10Muehlenhoff: [C: 03+2] Add cn=idptest-users to groups affected by offboarding [puppet] - 10https://gerrit.wikimedia.org/r/759446 (owner: 10Muehlenhoff) [09:03:15] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [09:03:22] jynus: fixed also the orchestrator lag [09:03:39] debmonitor looks good [09:03:52] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [09:03:53] I would like to know how you did it later (table cleanup?) [09:04:25] jynus: yep, I will let you know it is done [09:04:54] I think everything seems to be looking good [09:04:57] I think the largest other thing to check is the Link Recommendation Service [09:05:07] linkrecommendation looks ok [09:05:08] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [09:05:13] akosiaris: how did you test it? [09:05:14] cool then [09:05:14] (03PS2) 10Muehlenhoff: Add Cumin alias for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/759218 [09:05:18] nothing in the logs all pods [09:05:23] sweet [09:05:31] marostegui: kubectl logs -l release=internal -c linkrecommendation-internal |grep '2022-02-03T09:' |grep -v 200, [09:05:36] but logstash would do it [09:05:41] I don't see anything connected on the old master either [09:05:42] actually it would be better ;-) [09:05:45] akosiaris: noted, thanks :) [09:05:52] I just am a cli person I guess [09:05:54] docs say "recommendationapi: Normally requires a restart on scb" that may need doc update [09:06:11] look at recommendationapi now [09:06:16] looking* [09:06:18] jynus: so t o clean up the "lag" on orchestrator, you need to go to the new master and do a delete from heartbeat where server_id=OLDMASTER [09:06:31] I see [09:06:32] so that heartbeat bit is gone and only one heartbeat entry is there [09:07:11] nothing in the recommendation-api logs either [09:07:17] \o/ [09:07:34] jynus: yeah, remove that doc entry. It seems to be failing over ok now [09:07:38] I say those are the larger left because the others (debmonitor) are more internal [09:07:50] So we can conclude it is all good now? [09:07:55] +1 for me [09:08:24] akosiaris: will send a diff and you review it, ok? [09:09:15] for orchestator, my ask is to have the exact query done on wikitech, or the query to clean up "if lag is shown but it is not real" [09:09:28] maybe it is on the switchover doc [09:09:46] yep, will add it ( delete from heartbeat where server_id=171970778;) [09:09:56] jynusL: sure [09:09:59] if it is elsewhere, just a link would be ok [09:10:11] I will amend now /misc now [09:10:17] thank you both for the support! [09:10:37] jynus: do you have the wikitech link handy? [09:10:49] one sec so I finish the edit :-) [09:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20025 and previous config saved to /var/cache/conftool/dbconfig/20220203-091050-marostegui.json [09:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:53] it may need more revisions [09:10:54] jynus: yep! [09:11:00] or you mean the orchestator one? [09:11:27] on my side I don't have anything else to check, everything seems to work ok [09:11:31] (03CR) 10Muehlenhoff: "This looks fine, but I think we should set this in general, not just for the mail servers. While we've only seen this with the MXs for now" [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [09:12:01] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/759218 (owner: 10Muehlenhoff) [09:12:13] akosiaris, marostegui: https://wikitech.wikimedia.org/w/index.php?title=MariaDB/misc&diff=1946310&oldid=1907969 [09:12:23] jynus: I will add the query to the switchover docs [09:12:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298558)', diff saved to https://phabricator.wikimedia.org/P20026 and previous config saved to /var/cache/conftool/dbconfig/20220203-091224-marostegui.json [09:12:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [09:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [09:12:29] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [09:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [09:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [09:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298558)', diff saved to https://phabricator.wikimedia.org/P20027 and previous config saved to /var/cache/conftool/dbconfig/20220203-091237-marostegui.json [09:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:46] akosiaris: note people there is "people that in the past helped monitoring", not owners :-) [09:13:35] marostegui: my first search was looking at: https://wikitech.wikimedia.org/wiki/Orchestrator#Troubleshooting [09:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298558)', diff saved to https://phabricator.wikimedia.org/P20028 and previous config saved to /var/cache/conftool/dbconfig/20220203-091345-marostegui.json [09:13:47] but it may be somewhere else already [09:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:13] as technically it is not an orch issue, but a db issue [09:14:42] thanks! [09:14:57] (and obviously I am not complaining, just it is a good opportunity to update docs if outdated) :-) [09:16:34] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [09:17:06] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium [09:17:12] marostegui: if you add something or find it somewhere, I can add a link to it on the misc/m*/heartbeat docs too, so it is another way to find it, as now it says "nothing to do" :-) [09:17:45] jynus: I think I will add it to the switchover part [09:17:51] +1 [09:17:57] I will just add a link :-) [09:18:07] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) 05Open→03Resolved [09:18:28] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) All done, thanks a lot @akosiaris @jcrespo for the support! [09:18:52] so I guess we are done with the immadiate steps (even if I've done nothing :-D) [09:19:01] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10ayounsi) [09:22:44] jynus: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=1946319&oldid=1938207 [09:23:14] thank you a lot, will link to that from a few places :-) [09:23:20] (03PS1) 10Vgutierrez: envoyproxy:tls_terminator: Accept HTTP/1.0 for SNI traffic [puppet] - 10https://gerrit.wikimedia.org/r/759448 (https://phabricator.wikimedia.org/T300366) [09:25:06] (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759449 (https://phabricator.wikimedia.org/T300243) [09:25:08] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33567/console" [puppet] - 10https://gerrit.wikimedia.org/r/759448 (https://phabricator.wikimedia.org/T300366) (owner: 10Vgutierrez) [09:25:54] (03CR) 10Marostegui: [C: 03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759449 (https://phabricator.wikimedia.org/T300243) (owner: 10Marostegui) [09:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20029 and previous config saved to /var/cache/conftool/dbconfig/20220203-092554-marostegui.json [09:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20030 and previous config saved to /var/cache/conftool/dbconfig/20220203-092850-marostegui.json [09:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:21] (03PS1) 10Marostegui: mariadb: Move db1183 from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/759450 (https://phabricator.wikimedia.org/T300835) [09:29:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy:tls_terminator: Accept HTTP/1.0 for SNI traffic [puppet] - 10https://gerrit.wikimedia.org/r/759448 (https://phabricator.wikimedia.org/T300366) (owner: 10Vgutierrez) [09:30:26] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 ayounsi https://phabricator.wikimedia.org/T300838 - The acknowledgement expires at: 2022-02-19 09:30:13. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:30:26] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP ayounsi https://phabricator.wikimedia.org/T300838 - The acknowledgement expires at: 2022-02-19 09:30:13. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:30:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1183 from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/759450 (https://phabricator.wikimedia.org/T300835) (owner: 10Marostegui) [09:30:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy:tls_terminator: Accept HTTP/1.0 for SNI traffic [puppet] - 10https://gerrit.wikimedia.org/r/759448 (https://phabricator.wikimedia.org/T300366) (owner: 10Vgutierrez) [09:31:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1183.eqiad.wmnet with OS bullseye [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:30] marostegui: https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&type=revision&diff=1946325&oldid=1946310 [09:31:34] (03CR) 10Matthias Mullie: [C: 03+1] [WikibaseMediaInfo] Stop normalizing full text scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759240 (https://phabricator.wikimedia.org/T296631) (owner: 10Matthias Mullie) [09:31:44] I think it is much clear than what was written before [09:31:51] *clearer [09:32:09] \o/ [09:32:19] as in some it said "nothing required" [09:32:23] and it was a bit missleading [09:32:38] yeah [09:32:44] it is "fake" lag, but it needs to be cleaned up [09:32:52] otherwise orchestrator looks messy [09:33:04] yeah, but the less confusion and outdated docs, the better :-) [09:33:28] (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [09:33:29] Will add a link on the Orchestator page too, eg. if someone arrives there without context [09:33:36] sounds good, thanks [09:34:00] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10fgiunchedi) I've chatted with @ayounsi and now we have T300836 to track more effective tasks for inbound interface errors [09:38:23] https://wikitech.wikimedia.org/w/index.php?title=Orchestrator&type=revision&diff=1946327&oldid=1931074 [09:39:20] jynus: excellent [09:39:25] thanks [09:40:45] (03PS6) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [09:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20031 and previous config saved to /var/cache/conftool/dbconfig/20220203-094059-marostegui.json [09:41:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:41:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:04] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [09:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T300402)', diff saved to https://phabricator.wikimedia.org/P20032 and previous config saved to /var/cache/conftool/dbconfig/20220203-094107-marostegui.json [09:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:40] (03CR) 10jerkins-bot: [V: 04-1] Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [09:43:08] (03PS1) 10Kevin Bazira: ml-services: add model STORAGE_URI to enwiki-articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) [09:43:30] (03PS7) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [09:43:35] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) >>! In T300324#7673609, @RLazarus wrote: > Hm, reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye. I assume that's an oversight and... [09:43:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20033 and previous config saved to /var/cache/conftool/dbconfig/20220203-094354-marostegui.json [09:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300402)', diff saved to https://phabricator.wikimedia.org/P20034 and previous config saved to /var/cache/conftool/dbconfig/20220203-094447-marostegui.json [09:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:02] (03CR) 10Elukey: ml-services: add model STORAGE_URI to enwiki-articlequality transformer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) (owner: 10Kevin Bazira) [09:45:30] (03CR) 10Elukey: ml-services: add model STORAGE_URI to enwiki-articlequality transformer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) (owner: 10Kevin Bazira) [09:48:32] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [09:49:31] (03CR) 10Muehlenhoff: [C: 03+2] idp-test/puppetboard: Grant access to cn=idptest-users [puppet] - 10https://gerrit.wikimedia.org/r/759264 (owner: 10Muehlenhoff) [09:49:34] 10SRE, 10Traffic, 10Patch-For-Review: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez Thanks for reporting this issue, this unveiled a bug in our envoyproxy... [09:52:43] (03PS2) 10Kevin Bazira: ml-services: add model STORAGE_URI to enwiki-articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) [09:55:36] (03CR) 10Kevin Bazira: ml-services: add model STORAGE_URI to enwiki-articlequality transformer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) (owner: 10Kevin Bazira) [09:56:23] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: set domainrw to grafana-next-rw in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [09:56:28] (03PS4) 10Kormat: Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 [09:56:39] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add grafana-next and grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [09:57:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1183.eqiad.wmnet with OS bullseye [09:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298558)', diff saved to https://phabricator.wikimedia.org/P20036 and previous config saved to /var/cache/conftool/dbconfig/20220203-095859-marostegui.json [09:59:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:59:03] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [09:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298558)', diff saved to https://phabricator.wikimedia.org/P20037 and previous config saved to /var/cache/conftool/dbconfig/20220203-095907-marostegui.json [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:39] (03CR) 10Kormat: [C: 03+2] Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 (owner: 10Kormat) [09:59:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20038 and previous config saved to /var/cache/conftool/dbconfig/20220203-095952-marostegui.json [09:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:03] (03Merged) 10jenkins-bot: Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 (owner: 10Kormat) [10:02:54] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:03:37] FYI I am pooling the 5 new AQS servers in the cassandra 3 aqs_next cluster. [10:03:40] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:03] ^expected [10:06:04] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1010.eqiad.wmnet [10:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:15] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1011.eqiad.wmnet [10:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:21] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1012.eqiad.wmnet [10:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:26] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1013.eqiad.wmnet [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:34] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1014.eqiad.wmnet [10:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:41] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs1015.eqiad.wmnet [10:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:57] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1010.eqiad.wmnet [10:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:08] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1012.eqiad.wmnet [10:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:11] (03PS9) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [10:07:16] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1013.eqiad.wmnet [10:07:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298558)', diff saved to https://phabricator.wikimedia.org/P20039 and previous config saved to /var/cache/conftool/dbconfig/20220203-100716-marostegui.json [10:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1014.eqiad.wmnet [10:07:22] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [10:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:27] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1015.eqiad.wmnet [10:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:04] (03CR) 10Elukey: [C: 03+2] ml-services: add model STORAGE_URI to enwiki-articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/759456 (https://phabricator.wikimedia.org/T294141) (owner: 10Kevin Bazira) [10:12:30] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:12:54] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:14:48] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20040 and previous config saved to /var/cache/conftool/dbconfig/20220203-101456-marostegui.json [10:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:14] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:15:20] (03CR) 10Btullis: [C: 03+2] Fold-in (minor) upstream configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/757998 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans) [10:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20041 and previous config saved to /var/cache/conftool/dbconfig/20220203-102221-marostegui.json [10:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300402)', diff saved to https://phabricator.wikimedia.org/P20042 and previous config saved to /var/cache/conftool/dbconfig/20220203-103001-marostegui.json [10:30:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:30:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:06] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [10:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20043 and previous config saved to /var/cache/conftool/dbconfig/20220203-103008-marostegui.json [10:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:55] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:31:18] ACKNOWLEDGEMENT - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:18] ACKNOWLEDGEMENT - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:18] ACKNOWLEDGEMENT - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:18] ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:18] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20044 and previous config saved to /var/cache/conftool/dbconfig/20220203-103354-marostegui.json [10:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:59] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20045 and previous config saved to /var/cache/conftool/dbconfig/20220203-103725-marostegui.json [10:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:57] (03CR) 10Ayounsi: [C: 03+1] "1 non-blocker comment, but LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [10:48:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20046 and previous config saved to /var/cache/conftool/dbconfig/20220203-104858-marostegui.json [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] (03PS1) 10Cathal Mooney: Added IP ranges for new subnets Eqiad expansion cage E/F [puppet] - 10https://gerrit.wikimedia.org/r/759467 (https://phabricator.wikimedia.org/T299758) [10:51:04] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10jcrespo) > Has anyone from the organization ever accessed our Bing Webmaster Tools before? Maybe this is completely new for us? Based on Faidon's reponse, the (lack of) documentati... [10:51:44] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10jcrespo) a:03jcrespo [10:52:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298558)', diff saved to https://phabricator.wikimedia.org/P20047 and previous config saved to /var/cache/conftool/dbconfig/20220203-105230-marostegui.json [10:52:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:52:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:35] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [10:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:38] (03CR) 10Ayounsi: [C: 03+1] Added IP ranges for new subnets Eqiad expansion cage E/F [puppet] - 10https://gerrit.wikimedia.org/r/759467 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [10:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20048 and previous config saved to /var/cache/conftool/dbconfig/20220203-105238-marostegui.json [10:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20049 and previous config saved to /var/cache/conftool/dbconfig/20220203-105345-marostegui.json [10:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T1100). [11:00:31] (03PS2) 10Matthias Mullie: Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) [11:02:51] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:04:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20050 and previous config saved to /var/cache/conftool/dbconfig/20220203-110403-marostegui.json [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:33] (03CR) 10Cathal Mooney: [C: 03+2] Added IP ranges for new subnets Eqiad expansion cage E/F [puppet] - 10https://gerrit.wikimedia.org/r/759467 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:06:29] (03PS1) 10Vgutierrez: varnish: match listen_depth and net.core.soxmaxconn [puppet] - 10https://gerrit.wikimedia.org/r/759468 (https://phabricator.wikimedia.org/T290005) [11:07:59] (03CR) 10Aklapper: "Is there any way to proceed here? I prefer not to have my +1ed patches opened forever. :) Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [11:08:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20051 and previous config saved to /var/cache/conftool/dbconfig/20220203-110850-marostegui.json [11:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:38] (03CR) 10Majavah: Phabricator: Uninstall Conpherence application also in default settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [11:10:39] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:10:45] (03PS8) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [11:10:47] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:16] (03PS1) 104nn1l2: commonswiki: Add www.gbols.smns-bw.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759469 (https://phabricator.wikimedia.org/T300842) [11:11:44] (03CR) 10Cathal Mooney: [C: 03+2] Add eBGP peering between CR routers and datacenter switches. (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:11:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:12:28] (03Merged) 10jenkins-bot: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:13:11] (03PS2) 10JMeybohm: Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [11:13:13] (03PS2) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [11:13:15] (03PS1) 10JMeybohm: Create a new wikimedia_cluster: kubernetes-staging [puppet] - 10https://gerrit.wikimedia.org/r/759470 (https://phabricator.wikimedia.org/T273866) [11:13:58] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [11:14:20] (03PS2) 10JMeybohm: Create a new wikimedia_cluster: kubernetes-staging [puppet] - 10https://gerrit.wikimedia.org/r/759470 (https://phabricator.wikimedia.org/T273866) [11:14:22] (03PS3) 10JMeybohm: Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [11:14:24] (03PS3) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [11:14:39] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:15:17] (03PS6) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) [11:15:22] (03PS4) 10JMeybohm: Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [11:15:24] !log Adding BGP peering to lsw1-f1-eqiad on cr2-eqiad. T299758. [11:15:25] (03CR) 10jerkins-bot: [V: 04-1] Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [11:15:26] (03PS4) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [11:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:28] T299758: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 [11:15:57] (03PS1) 10Marostegui: db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759472 (https://phabricator.wikimedia.org/T300835) [11:16:08] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [11:16:26] (03CR) 10Ema: [C: 03+1] varnish: match listen_depth and net.core.soxmaxconn [puppet] - 10https://gerrit.wikimedia.org/r/759468 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:17:28] (03PS7) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) [11:17:38] (03CR) 10Marostegui: [C: 03+2] db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759472 (https://phabricator.wikimedia.org/T300835) (owner: 10Marostegui) [11:17:48] (03CR) 10Filippo Giunchedi: initial sketch of watchrat alert (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [11:17:55] (03CR) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [11:18:17] (03CR) 10Vgutierrez: [C: 03+2] varnish: match listen_depth and net.core.soxmaxconn [puppet] - 10https://gerrit.wikimedia.org/r/759468 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:19:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300402)', diff saved to https://phabricator.wikimedia.org/P20052 and previous config saved to /var/cache/conftool/dbconfig/20220203-111908-marostegui.json [11:19:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:19:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:19:13] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T300402)', diff saved to https://phabricator.wikimedia.org/P20053 and previous config saved to /var/cache/conftool/dbconfig/20220203-111921-marostegui.json [11:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:45] (03CR) 10Filippo Giunchedi: "LGTM, modulo a nit" [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [11:22:36] (03PS2) 10Btullis: Use the system default mysql prometheus exporter for analytics-meta and matomo [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) [11:23:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300402)', diff saved to https://phabricator.wikimedia.org/P20054 and previous config saved to /var/cache/conftool/dbconfig/20220203-112311-marostegui.json [11:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:41] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:23:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20055 and previous config saved to /var/cache/conftool/dbconfig/20220203-112355-marostegui.json [11:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759470 (https://phabricator.wikimedia.org/T273866) (owner: 10JMeybohm) [11:26:02] !log rolling varnish-fe restart to catch the new listen_depth config value [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [11:28:27] (03CR) 10Filippo Giunchedi: [C: 03+1] Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [11:28:54] 10SRE, 10SRE-Access-Requests: saisuman ssh keys have been uploaded to WMCS - https://phabricator.wikimedia.org/T300708 (10jcrespo) Hey, @SCherukuwada, just to clarify the original post- this is a relatively common mistake. The keys have been revoked out of precaution, but not your access (access only has been... [11:30:53] (03PS1) 10Cathal Mooney: Removed local-as statement on eBGP peering from CR to SPINE [homer/public] - 10https://gerrit.wikimedia.org/r/759473 (https://phabricator.wikimedia.org/T299758) [11:31:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [11:31:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I've been navigating http://downloads.linux.hpe.com/SDR/repo/mcp/debian/ by hand, and this patch seems correct." [puppet] - 10https://gerrit.wikimedia.org/r/758050 (https://phabricator.wikimedia.org/T300438) (owner: 10Majavah) [11:31:44] (03CR) 10Cathal Mooney: [C: 03+2] Removed local-as statement on eBGP peering from CR to SPINE [homer/public] - 10https://gerrit.wikimedia.org/r/759473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:32:01] (03CR) 10Btullis: Use the system default mysql prometheus exporter for analytics-meta and matomo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [11:32:11] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Removed local-as statement on eBGP peering from CR to SPINE [homer/public] - 10https://gerrit.wikimedia.org/r/759473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:32:19] (03Merged) 10jenkins-bot: Removed local-as statement on eBGP peering from CR to SPINE [homer/public] - 10https://gerrit.wikimedia.org/r/759473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:32:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33568/console" [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [11:33:01] !log draining ganeti1020 for eventual reimage [11:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:06] (03CR) 10Jcrespo: "Aklapper: hit me on irc (jynus) at #wikimedia-operations for deploying now." [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [11:33:27] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1428.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:34:01] (03CR) 10Aklapper: "(Furthermore, I have no idea what the line `source /etc/phab_mfa_check.conf` is good for)" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [11:34:07] (03PS1) 10Vgutierrez: cumin: Add cache::text_envoy to cp-text alias [puppet] - 10https://gerrit.wikimedia.org/r/759474 (https://phabricator.wikimedia.org/T271421) [11:34:49] is not andre around? [11:35:10] (03PS1) 10KartikMistry: Update cxserver to 2022-02-03-112745-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/759475 [11:35:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759474 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:35:38] thx moritzm [11:35:59] you're faster than jerkins :D [11:36:13] !log reprepro changes @ apt1001 after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/758050 [11:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:20] (03CR) 10Vgutierrez: [C: 03+2] cumin: Add cache::text_envoy to cp-text alias [puppet] - 10https://gerrit.wikimedia.org/r/759474 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:36:37] *jenkins [11:37:02] and less pedantic :-) [11:37:34] mvolz: Safe to deploy important change of cxserver right now? (Because of citoid deployment windows is on..) [11:38:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20056 and previous config saved to /var/cache/conftool/dbconfig/20220203-113815-marostegui.json [11:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20057 and previous config saved to /var/cache/conftool/dbconfig/20220203-113859-marostegui.json [11:39:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:39:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:04] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [11:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298558)', diff saved to https://phabricator.wikimedia.org/P20058 and previous config saved to /var/cache/conftool/dbconfig/20220203-113907-marostegui.json [11:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298558)', diff saved to https://phabricator.wikimedia.org/P20059 and previous config saved to /var/cache/conftool/dbconfig/20220203-114015-marostegui.json [11:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:30] Looks like nothing deployed during the window.. So going ahead. [11:40:48] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-02-03-112745-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/759475 (owner: 10KartikMistry) [11:42:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [11:44:35] (03Merged) 10jenkins-bot: Update cxserver to 2022-02-03-112745-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/759475 (owner: 10KartikMistry) [11:45:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:31] !log installing openjdk-11 security updates [11:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:56] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply on staging [11:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:58] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply on production [11:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:33] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: sync on staging [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:23] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply on production [11:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:27] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply on staging [11:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:23] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: sync on production [11:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:31] 10SRE: Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10MoritzMuehlenhoff) [11:52:15] jynus: Hey there! Re https://gerrit.wikimedia.org/r/c/operations/puppet/+/542787 (though I'm clueless what I'm supposed to do or help with tbh) [11:52:45] actually, not much just being around, make sure the patch works as intended and help in case of a fire [11:53:08] you are the phab expert of the 2 :-) [11:53:19] (03PS3) 10Jcrespo: Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [11:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20060 and previous config saved to /var/cache/conftool/dbconfig/20220203-115320-marostegui.json [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:54] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply on production [11:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:57] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply on staging [11:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:07] andre: I am on clinic duty today, anything that is not super clear who will own, ask the person on topic [11:54:12] makes sense [11:54:32] and it will be handleed by him/her or redirect you to the right person [11:54:53] that app has been disabled anyway for two years; no signs of it in the Phab UI config anywhere, I'd just want stuff to be in sync... [11:55:11] that's ok, the problem is noone- but me could deploy, I think that was CCed [11:55:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20061 and previous config saved to /var/cache/conftool/dbconfig/20220203-115519-marostegui.json [11:55:21] so SREs were not even aware that patch was pending, probably [11:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:42] (03CR) 10Jcrespo: [C: 03+2] Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [11:55:42] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: sync on production [11:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:53] ^the above is your patch [11:55:58] about to merge now [11:56:13] yay [11:56:32] merged, but not deployed yet [11:56:37] now running puppet on phab server [11:57:06] !log ulsfo: push Capirca generated border-in filters [11:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:11] !log Updated cxserver to 2022-02-03-112745-production, this should unbreak Flores MT! [11:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:02] andre: Notice: /Stage[main]/Phabricator/File[/srv/phab/phabricator/conf/local/local.json]/content: content changed '{md5}7535d3577c48b3538d3881e28a447cba' to '{md5}e260c83b38afd9edd6cc9c043314d3af' [11:58:17] I didn't see any automatic daemon refresh [11:58:33] do you know if the confi change need a refresh on the service? [11:59:06] or can you check the change applied somehow? [11:59:13] as in, the live service [12:00:04] Amir1, Lucas_WMDE, and apergos: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T1200). [12:00:05] matthiasmullie and nn1l2: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:08] o/ [12:00:08] hi [12:00:14] there are no trainees signed up today [12:00:27] and I think A*mir is out also [12:00:28] there are two config patches scheduled in the window and I see both patch owners are here [12:00:36] hey [12:00:40] jynus: that's the irony: there has been no sign anymore anyway of the Conpherence app in Phab since https://phabricator.wikimedia.org/T127640 so I wouldn't even know what to check [12:00:45] (03PS1) 104nn1l2: mniwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759479 (https://phabricator.wikimedia.org/T294709) [12:00:50] both look reasonable at first glance [12:00:57] I can deploy unless someone else wants to? [12:01:07] do either of you need someone to deploy your patches or are you both self-service? [12:01:09] jynus: I'd assume no refresh is needed [12:01:17] andre: ok, then that's all [12:01:20] thanks! [12:01:24] o/ [12:01:28] there is a 99% of cases where something bad happens [12:01:30] matthiasmullie: nn1l2 [12:01:34] it is just being around for the other 1% [12:01:36] :-) [12:02:05] andre: also feel free to ping me for any pending deployment for today [12:02:05] I have been pinged. why? [12:02:12] (02:01:07 μμ) apergos: do either of you need someone to deploy your patches or are you both self-service? [12:02:20] happy to self-service [12:02:45] (03PS3) 10Aklapper: Phabricator: add override for the browser time zone conflict message [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712) [12:02:51] ok, matthiasmullie, as you are first in the list, go ahead (unless nn1l2 you think your testing will take longer or there is another reason to go first) [12:03:05] I can wait [12:03:08] ok! [12:03:11] I ahve no hurry at all [12:03:20] I prefer to be the last [12:03:29] I'm working on another patch [12:03:36] and nn1l2 will you be merging/deploying yourself, or do you need an assist? [12:03:42] no [12:03:42] alright I'll go [12:03:52] I need a deployer [12:03:54] (03PS1) 10Arturo Borrero Gonzalez: base: standard_packages: don't install hp-health on Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) [12:04:01] (03PS2) 10Aklapper: Fix broken rendering of characters in EasyTimeline for Yue Chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) [12:04:10] (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] Stop normalizing full text scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759240 (https://phabricator.wikimedia.org/T296631) [12:04:15] (03CR) 10Matthias Mullie: [C: 03+2] [WikibaseMediaInfo] Stop normalizing full text scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759240 (https://phabricator.wikimedia.org/T296631) (owner: 10Matthias Mullie) [12:04:22] okey dokey, we'll take care of it when it's your turn. [12:04:33] 10SRE, 10SRE-Access-Requests: saisuman ssh keys have been uploaded to WMCS - https://phabricator.wikimedia.org/T300708 (10jcrespo) p:05Triage→03Medium [12:04:47] (03CR) 10jerkins-bot: [V: 04-1] Fix broken rendering of characters in EasyTimeline for Yue Chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper) [12:04:48] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) p:05Triage→03High [12:05:18] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) So after more rounds of tests: * Special:UploadWizard still works just fine, given it's an async process it's completely unfazed by the transition to... [12:05:36] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Stop normalizing full text scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759240 (https://phabricator.wikimedia.org/T296631) (owner: 10Matthias Mullie) [12:07:22] (03PS3) 10Aklapper: Fix broken rendering of characters in EasyTimeline for Yue Chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) [12:08:12] (03Abandoned) 10Aklapper: Fix broken rendering of characters in EasyTimeline for Yue Chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper) [12:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300402)', diff saved to https://phabricator.wikimedia.org/P20062 and previous config saved to /var/cache/conftool/dbconfig/20220203-120825-marostegui.json [12:08:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:08:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:30] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T300402)', diff saved to https://phabricator.wikimedia.org/P20063 and previous config saved to /var/cache/conftool/dbconfig/20220203-120832-marostegui.json [12:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:53] !log mlitn@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759240|[WikibaseMediaInfo] Stop normalizing full text scores (T296631)]] (duration: 00m 52s) [12:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:58] T296631: Reconsider normalizeFulltextScores implementation - https://phabricator.wikimedia.org/T296631 [12:10:13] !log eqord: push Capirca generated border-in filters [12:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:24] apergos: I'm done [12:10:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20064 and previous config saved to /var/cache/conftool/dbconfig/20220203-121024-marostegui.json [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:37] awesome [12:11:09] nn1l2: are you ready? oh hey who wants to do this deploy, Lucas_WMDE, taavi, me? [12:11:15] sure, I can do it [12:11:19] ok then go ahead [12:11:24] yes [12:11:33] your patch is up, ready? [12:11:59] (03PS2) 10Majavah: commonswiki: Add www.gbols.smns-bw.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759469 (https://phabricator.wikimedia.org/T300842) (owner: 104nn1l2) [12:12:04] after the merge, it will go out to mwdebug100... um... 2? (forget) first, you should be ready to test it there [12:12:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300402)', diff saved to https://phabricator.wikimedia.org/P20065 and previous config saved to /var/cache/conftool/dbconfig/20220203-121216-marostegui.json [12:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:31] to make sure the change is actually reflected. [12:12:31] (03CR) 10Majavah: [C: 03+2] commonswiki: Add www.gbols.smns-bw.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759469 (https://phabricator.wikimedia.org/T300842) (owner: 104nn1l2) [12:12:39] apergos: nn1l2 is familiar with the process [12:12:49] ok, great! [12:12:56] * apergos chills [12:13:39] (03Merged) 10jenkins-bot: commonswiki: Add www.gbols.smns-bw.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759469 (https://phabricator.wikimedia.org/T300842) (owner: 104nn1l2) [12:14:04] nn1l2: your patch is available for testing on mwdebug1001 [12:15:02] LGTM: https://commons.wikimedia.org/wiki/File:Xysticus_luctuosus_SMNS-1065_01.jpg [12:15:07] great, syncing [12:15:08] \o/ [12:15:17] Is B&C near its end? [12:15:29] if you have more patches I can look at them [12:15:31] I'm working on another patch, I need 5 min [12:15:39] 45 more minutes to the window [12:15:44] sure, just ping me when you have something [12:15:55] but make sure you add the patch info to the calendar so we have the record of it there too [12:16:00] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759469|commonswiki: Add www.gbols.smns-bw.org to the wgCopyUploadsDomains allowlist (T300842)]] (duration: 00m 50s) [12:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:04] T300842: Add https://www.gbols.smns-bw.org/ to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300842 [12:16:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:16:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:49] (03PS2) 104nn1l2: mniwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759479 (https://phabricator.wikimedia.org/T294709) [12:16:51] also in case you have now heard this spiel before, we encourage every developer who needs patches deployed to get training and learn to deploy, you can come to multiple sessions if you want until you are comfortable. [12:16:56] *have not heard [12:17:32] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) [12:18:01] (03PS10) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [12:19:37] !log codfw: push Capirca generated border-in filters [12:19:38] I have another patch [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] Do you still deploy? [12:20:00] yes [12:20:15] https://wikitech.wikimedia.org/wiki/Deployments#%7B%7BDeployment_day%7Cdate%3D2022-02-03%7D%7D [12:23:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:23:19] I'll admit I have no clue how to review a wordmark change other than "it sort of looks like what's being requested in the task" [12:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:35] :-D same [12:23:51] I have a sort of "looks reasonable" criterion [12:23:53] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:23:58] so we should wait for someone from the design team? [12:24:16] someone ought to be able to +1 it where that +1 has meaning, yeah [12:24:25] whoever that is, you would know best [12:24:31] I'm pretty well-vered in SVG [12:24:32] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) {icon check} Public key provided seems fine, and different from that on LDAP. {icon check} Manager approval verified through HR tools and check_use... [12:24:57] I think we can go ahead [12:25:03] https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests (my usual go-to for any usual config changes) doesn't mention it either [12:25:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298558)', diff saved to https://phabricator.wikimedia.org/P20066 and previous config saved to /var/cache/conftool/dbconfig/20220203-122529-marostegui.json [12:25:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:25:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:33] apergos: I'm willing to sync it if you don't object, we can always revert if there are problems [12:25:34] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:36] Look at this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704376/ [12:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [12:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [12:25:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [12:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:51] I followed that exactly [12:25:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [12:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:26:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [12:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:11] shrug [12:26:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298558)', diff saved to https://phabricator.wikimedia.org/P20067 and previous config saved to /var/cache/conftool/dbconfig/20220203-122612-marostegui.json [12:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:21] go ahead [12:26:27] I'm good at SVG, commons admin, see my Commons userpage: https://commons.wikimedia.org/wiki/User:4nn1l2 [12:26:38] taavi: additions to https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests are always welcome :) (that page is kind of an experiment to allow tech-curious folks to try stuff themselves) [12:26:56] (03PS3) 10Majavah: mniwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759479 (https://phabricator.wikimedia.org/T294709) (owner: 104nn1l2) [12:26:56] it's not that so much as there's a rule about having at least someone else with a clue cr things. but anyways [12:27:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20068 and previous config saved to /var/cache/conftool/dbconfig/20220203-122720-marostegui.json [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] not gonna block it. [12:28:02] (03PS1) 10Muehlenhoff: Set equal weights for mx2001 [dns] - 10https://gerrit.wikimedia.org/r/759496 [12:28:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:56] (03CR) 10Majavah: [C: 03+2] mniwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759479 (https://phabricator.wikimedia.org/T294709) (owner: 104nn1l2) [12:29:37] (03Merged) 10jenkins-bot: mniwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759479 (https://phabricator.wikimedia.org/T294709) (owner: 104nn1l2) [12:29:56] !log eqsin: push Capirca generated border-in filters [12:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:24] nn1l2: can you test on mwdebug1001 please? [12:30:28] ok [12:31:05] It looks perfect [12:31:10] \o/ [12:31:11] I can send a scrreenshot [12:31:15] if need be [12:31:27] syncing [12:31:27] here? nah. but I am sure the people on the task will love it :-) [12:31:39] Thanks [12:32:33] !log taavi@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-mni.svg: Config: [[gerrit:759479|mniwiktionary: Add localized mobile wordmark (T294709)]] (1/2) (duration: 00m 50s) [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:39] T294709: Add Localized mobile wordmark logo in Meetei Wiktionary. - https://phabricator.wikimedia.org/T294709 [12:32:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:32:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:33:34] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759479|mniwiktionary: Add localized mobile wordmark (T294709)]] (2/2) (duration: 00m 49s) [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:57] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:35:00] 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10jcrespo) Hi, I am trying to understand which team can help with this: is this an #observability issue? A #continuous-integration-infrastructure issue? A production #kubernetes image issue (so #serviceops) ? As... [12:37:11] does anyone have anything else to deploy? [12:38:12] crickets... [12:38:17] looks like that's it for the window. [12:38:23] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/759498 (https://phabricator.wikimedia.org/T296334) [12:38:27] !log UTC morning backport window done [12:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:37] see everyone next time! [12:38:48] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/759498 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [12:39:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20069 and previous config saved to /var/cache/conftool/dbconfig/20220203-124225-marostegui.json [12:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10jcrespo) Hey, Andy- moving the pending tag to Radar (the reason is that an #SRE -only tag in backlog will normally get monitored and pinged by the SRE... [12:43:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:43:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:43:14] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/759498 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [12:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:55] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [12:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [12:43:59] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [12:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:00] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on staging [12:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:39] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [12:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:42] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [12:44:43] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [12:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:23] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:46:49] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:47:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:57] (03CR) 10Jbond: [C: 04-1] "-1: see comments inline this should have already been included in the package" [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [12:48:32] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [12:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:49] 10SRE: Find a way to verify mediawiki-config IPs ahead of datacenter switchovers - https://phabricator.wikimedia.org/T163354 (10jcrespo) p:05High→03Low 5 years without updates- setting the priority to reflect reality rather than the original idea. [12:49:06] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:48] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on external [12:49:48] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on internal [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:50] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply on staging [12:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:10] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on external [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:47] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on internal [12:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:25] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on external [12:53:25] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on internal [12:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:29] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply on staging [12:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300402)', diff saved to https://phabricator.wikimedia.org/P20071 and previous config saved to /var/cache/conftool/dbconfig/20220203-125730-marostegui.json [12:57:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:57:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:34] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T300402)', diff saved to https://phabricator.wikimedia.org/P20072 and previous config saved to /var/cache/conftool/dbconfig/20220203-125737-marostegui.json [12:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:50] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on external [12:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:35] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on internal [12:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:25] (03PS11) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [13:00:31] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33569/console" [puppet] - 10https://gerrit.wikimedia.org/r/759470 (https://phabricator.wikimedia.org/T273866) (owner: 10JMeybohm) [13:01:48] 10SRE, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10jcrespo) @Dzahn @RLazarus Maybe this task can be closed now? Any pending subtasks? [13:02:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300402)', diff saved to https://phabricator.wikimedia.org/P20073 and previous config saved to /var/cache/conftool/dbconfig/20220203-130224-marostegui.json [13:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Create a new wikimedia_cluster: kubernetes-staging [puppet] - 10https://gerrit.wikimedia.org/r/759470 (https://phabricator.wikimedia.org/T273866) (owner: 10JMeybohm) [13:03:25] (03CR) 10Jbond: "LGTM but wonder if DC ops needs to find a replacement?" [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [13:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298558)', diff saved to https://phabricator.wikimedia.org/P20074 and previous config saved to /var/cache/conftool/dbconfig/20220203-130430-marostegui.json [13:04:33] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:35] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [13:04:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/759496 (owner: 10Muehlenhoff) [13:05:05] (03CR) 10Muehlenhoff: ferm: replace systemd unit to ensure success on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [13:08:23] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:11:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:15:02] 10SRE, 10Observability-Logging: Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10fgiunchedi) [13:15:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:35] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:16:22] (03PS17) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [13:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20075 and previous config saved to /var/cache/conftool/dbconfig/20220203-131729-marostegui.json [13:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [13:19:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20076 and previous config saved to /var/cache/conftool/dbconfig/20220203-131935-marostegui.json [13:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:19] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:21:45] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1020 [13:24:49] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:24:50] (03PS1) 10Cathal Mooney: Add inbound and outbound BGP filters on CR to SPINE eBGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/759500 (https://phabricator.wikimedia.org/T299758) [13:27:04] !log esams: push Capirca generated border-in filters [13:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:16] !log moved kubernetes staging master,nodes,etcd from wikimedia_cluster "kubernetes" to "kubernetes-staging" - T273866 [13:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:19] T273866: Kubernetes prod and staging share the same cluster key in hiera - https://phabricator.wikimedia.org/T273866 [13:27:21] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:28:00] !log installing apache security updates [13:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:30] !log Test T300858 [13:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:33] T300858: Testing task - https://phabricator.wikimedia.org/T300858 [13:31:26] (03PS2) 10Cathal Mooney: Add inbound and outbound BGP filters on CR to SPINE eBGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/759500 (https://phabricator.wikimedia.org/T299758) [13:32:17] (03CR) 10Ayounsi: [C: 03+1] "As discussed over IRC, change LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/759500 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:32:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20077 and previous config saved to /var/cache/conftool/dbconfig/20220203-133234-marostegui.json [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:33:39] (03PS5) 10JMeybohm: Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [13:33:41] (03PS5) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [13:34:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20078 and previous config saved to /var/cache/conftool/dbconfig/20220203-133439-marostegui.json [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:12] (03CR) 10Cathal Mooney: [C: 03+2] Add inbound and outbound BGP filters on CR to SPINE eBGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/759500 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:35:17] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-staging LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/759253 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [13:35:26] !log disable puppet fleet wide for puppetdb restart [13:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:33] (03CR) 10Ladsgroup: [C: 03+1] filtered_tables.txt: Add tl_target_id [puppet] - 10https://gerrit.wikimedia.org/r/759380 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [13:36:22] (03Merged) 10jenkins-bot: Add inbound and outbound BGP filters on CR to SPINE eBGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/759500 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:36:51] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:37:03] (03CR) 10Jbond: [V: 03+1] "FYI i have asked moritz to take a look at the the debian packaging part[1], once that is good ill build a package and deploy this change" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:37:53] (03CR) 10Ladsgroup: [C: 03+1] "Generally looks good, but either wait for T300702 to be implemented or add the stop replication magic there." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759379 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [13:38:51] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:39:22] fat fingers ment i didn;t disable puppetdb so we will get an alert soon sorry [13:40:15] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:40:30] jbond: too much finger-food does that ;) [13:40:45] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:48] :D [13:41:00] * jbond mmm finger food [13:41:37] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.03017 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:42:12] (03Abandoned) 10Phuedx: Update .mailmap to de-duplicate my email addresses [puppet] - 10https://gerrit.wikimedia.org/r/648239 (owner: 10Phuedx) [13:43:32] ^^^ this should clear in the next 5 mins [13:44:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_tl_target_id_T300775.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759379 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [13:45:35] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Add tl_target_id [puppet] - 10https://gerrit.wikimedia.org/r/759380 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [13:47:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300402)', diff saved to https://phabricator.wikimedia.org/P20079 and previous config saved to /var/cache/conftool/dbconfig/20220203-134739-marostegui.json [13:47:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [13:47:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:43] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:46] (03CR) 10Volans: "All comments were addressed, I'll check the tests now. The code part LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T300402)', diff saved to https://phabricator.wikimedia.org/P20080 and previous config saved to /var/cache/conftool/dbconfig/20220203-134746-marostegui.json [13:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:37] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005587 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298558)', diff saved to https://phabricator.wikimedia.org/P20081 and previous config saved to /var/cache/conftool/dbconfig/20220203-134944-marostegui.json [13:49:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [13:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [13:49:49] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [13:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298558)', diff saved to https://phabricator.wikimedia.org/P20082 and previous config saved to /var/cache/conftool/dbconfig/20220203-134952-marostegui.json [13:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300402)', diff saved to https://phabricator.wikimedia.org/P20083 and previous config saved to /var/cache/conftool/dbconfig/20220203-135029-marostegui.json [13:50:31] (03CR) 10Jbond: "god i dont know what broke ci checking now" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:09] !log eqiad: push Capirca generated border-in filters [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:55:58] (03PS6) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [13:56:00] (03PS4) 10Ayounsi: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 [13:58:33] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:58:44] (03PS18) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [14:01:53] (03CR) 10Ayounsi: [C: 03+2] Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [14:02:31] (03Merged) 10jenkins-bot: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [14:02:33] (03Merged) 10jenkins-bot: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 (owner: 10Ayounsi) [14:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298558)', diff saved to https://phabricator.wikimedia.org/P20084 and previous config saved to /var/cache/conftool/dbconfig/20220203-140503-marostegui.json [14:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:05:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20085 and previous config saved to /var/cache/conftool/dbconfig/20220203-140534-marostegui.json [14:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#7670485, @Volans wrote: > I had a chat with @jbond about this yesterday, putting the summary here for future reference for tho... [14:13:19] (03CR) 10Elukey: [C: 03+1] "let's merge!" [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [14:13:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the system default mysql prometheus exporter for analytics-meta and matomo [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) (owner: 10Btullis) [14:14:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:16:11] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:18:41] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20086 and previous config saved to /var/cache/conftool/dbconfig/20220203-142007-marostegui.json [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20087 and previous config saved to /var/cache/conftool/dbconfig/20220203-142039-marostegui.json [14:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:39] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:23:17] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys [14:23:17] move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:27:17] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:30:15] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:32:48] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) >>! In T293209#7675048, @fgiunchedi wrote: >>>! In T293209#7670485, @Volans wrote: >> - To support also the downtime of specific services (in I... [14:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20088 and previous config saved to /var/cache/conftool/dbconfig/20220203-143512-marostegui.json [14:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300402)', diff saved to https://phabricator.wikimedia.org/P20089 and previous config saved to /var/cache/conftool/dbconfig/20220203-143544-marostegui.json [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:48] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:36:37] (03PS6) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [14:36:39] (03PS5) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [14:38:19] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [14:40:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:40:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [14:40:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [14:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [14:40:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T300402)', diff saved to https://phabricator.wikimedia.org/P20090 and previous config saved to /var/cache/conftool/dbconfig/20220203-144017-marostegui.json [14:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300402)', diff saved to https://phabricator.wikimedia.org/P20091 and previous config saved to /var/cache/conftool/dbconfig/20220203-144224-marostegui.json [14:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:29] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:43:22] (03PS1) 10Cathal Mooney: Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759503 (https://phabricator.wikimedia.org/T299758) [14:44:29] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] (03PS6) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [14:46:08] (03PS2) 10Cathal Mooney: Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759503 (https://phabricator.wikimedia.org/T299758) [14:48:53] (03CR) 10Ayounsi: [C: 03+1] "LGTM! This will most likely change by the time we have a similar setup in other sites :)" [homer/public] - 10https://gerrit.wikimedia.org/r/759503 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:49:13] (03PS1) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) [14:49:15] (03CR) 10Cathal Mooney: [C: 03+2] Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759503 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:49:50] (03Merged) 10jenkins-bot: Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759503 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298558)', diff saved to https://phabricator.wikimedia.org/P20092 and previous config saved to /var/cache/conftool/dbconfig/20220203-145017-marostegui.json [14:50:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:50:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:22] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20093 and previous config saved to /var/cache/conftool/dbconfig/20220203-145024-marostegui.json [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20094 and previous config saved to /var/cache/conftool/dbconfig/20220203-145132-marostegui.json [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:11] (03PS7) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [14:54:39] (03PS1) 10Cathal Mooney: Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759505 [14:55:12] (03CR) 10Cathal Mooney: [C: 03+2] Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759505 (owner: 10Cathal Mooney) [14:55:38] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759505 (owner: 10Cathal Mooney) [14:55:47] (03Merged) 10jenkins-bot: Adjust CR templates so BGP_Switch_In doesn't reference K8s policy. [homer/public] - 10https://gerrit.wikimedia.org/r/759505 (owner: 10Cathal Mooney) [14:56:32] (03CR) 10Ayounsi: Port labs-in4/6 to Capirca (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [14:57:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P20095 and previous config saved to /var/cache/conftool/dbconfig/20220203-145729-marostegui.json [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:14] (03PS12) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [15:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20096 and previous config saved to /var/cache/conftool/dbconfig/20220203-150636-marostegui.json [15:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:43] (03CR) 10Hashar: "Summary of the changes I have made today (between patchset 7 and 12)." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [15:11:43] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Fantastic, thanks so so much @jcrespo! Pls don't hesitate to reach out if there's anything at all that I can help with! :) :) [15:12:25] !log installing apache security updates on gerrit1001 [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P20097 and previous config saved to /var/cache/conftool/dbconfig/20220203-151234-marostegui.json [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:20] (03PS2) 10Muehlenhoff: Set equal weights for mx2001 [dns] - 10https://gerrit.wikimedia.org/r/759496 [15:14:20] is lists.wikimedia.org also falling over or something (getting very long response times) [15:17:28] (03CR) 10Elukey: [C: 03+1] Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:17:30] (03PS7) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [15:17:32] (03PS6) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [15:19:06] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [15:19:44] (03CR) 10Muehlenhoff: "Or on a more general level;" [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [15:21:05] (03CR) 10Muehlenhoff: base: standard_packages: don't install hp-health on Debian Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [15:21:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20098 and previous config saved to /var/cache/conftool/dbconfig/20220203-152141-marostegui.json [15:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:35] (03CR) 10Muehlenhoff: [C: 03+1] "But let's merge this as it is to unblock the Bullseye/WMCS work and we can still figure out the followup later." [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [15:25:00] 10SRE-Access-Requests: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10cchen) [15:27:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300402)', diff saved to https://phabricator.wikimedia.org/P20099 and previous config saved to /var/cache/conftool/dbconfig/20220203-152739-marostegui.json [15:27:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:27:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T300402)', diff saved to https://phabricator.wikimedia.org/P20100 and previous config saved to /var/cache/conftool/dbconfig/20220203-152746-marostegui.json [15:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300402)', diff saved to https://phabricator.wikimedia.org/P20101 and previous config saved to /var/cache/conftool/dbconfig/20220203-152953-marostegui.json [15:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] (03PS3) 10Dzahn: icinga: Add Papaul to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759364 (https://phabricator.wikimedia.org/T300660) (owner: 10Papaul) [15:31:53] (03CR) 10Elukey: [C: 03+2] helmfile.d: add circuit breaking settings for ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [15:31:58] (03CR) 10Dzahn: [C: 03+1] icinga: Add Papaul to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759364 (https://phabricator.wikimedia.org/T300660) (owner: 10Papaul) [15:34:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298558)', diff saved to https://phabricator.wikimedia.org/P20102 and previous config saved to /var/cache/conftool/dbconfig/20220203-153646-marostegui.json [15:36:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [15:36:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [15:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:51] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [15:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298558)', diff saved to https://phabricator.wikimedia.org/P20103 and previous config saved to /var/cache/conftool/dbconfig/20220203-153653-marostegui.json [15:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:25] (03CR) 10Phuedx: Update Event Stream for IPInfo events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:37:30] (03CR) 10Phuedx: [C: 04-1] Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [15:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298558)', diff saved to https://phabricator.wikimedia.org/P20104 and previous config saved to /var/cache/conftool/dbconfig/20220203-153801-marostegui.json [15:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) Got it! thanks so much again, @jcrespo :) [15:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P20105 and previous config saved to /var/cache/conftool/dbconfig/20220203-154458-marostegui.json [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:12] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10ayounsi) [15:50:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:51:26] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: sorting by message (column 6) appears broken - https://phabricator.wikimedia.org/T300298 (10dancy) 05Open→03Resolved a:03dancy [15:52:45] (03CR) 10Papaul: [C: 03+2] icinga: Add Papaul to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759364 (https://phabricator.wikimedia.org/T300660) (owner: 10Papaul) [15:53:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20106 and previous config saved to /var/cache/conftool/dbconfig/20220203-155306-marostegui.json [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:07] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2011.codfw.wmnet [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:52] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts restbase2011.codfw.wmnet [15:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:56:26] (03PS3) 10Jbond: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:56:37] (03PS1) 10Filippo Giunchedi: prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 [16:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P20107 and previous config saved to /var/cache/conftool/dbconfig/20220203-160003-marostegui.json [16:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2011.codfw.wmnet [16:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:28] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) [16:05:30] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) [16:06:06] (03CR) 10Jbond: [C: 04-1] "see inline for comments, -1 is just for the user gid" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:08:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20108 and previous config saved to /var/cache/conftool/dbconfig/20220203-160811-marostegui.json [16:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:12] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2030.mgmt.codfw.wmnet with reboot policy FORCED [16:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:16] (03PS2) 10Jbond: C:mw_rc_irc::ircserver: Refresh ircd services on config changes [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) [16:10:24] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2030.mgmt.codfw.wmnet with reboot policy FORCED [16:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:52] (03CR) 10jerkins-bot: [V: 04-1] C:mw_rc_irc::ircserver: Refresh ircd services on config changes [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [16:13:41] (03CR) 10Herron: [C: 03+1] prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [16:14:15] (03CR) 10Jbond: C:mw_rc_irc::ircserver: Refresh ircd services on config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [16:14:40] (03CR) 10Muehlenhoff: [C: 03+2] Set equal weights for mx2001 [dns] - 10https://gerrit.wikimedia.org/r/759496 (owner: 10Muehlenhoff) [16:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300402)', diff saved to https://phabricator.wikimedia.org/P20109 and previous config saved to /var/cache/conftool/dbconfig/20220203-161508-marostegui.json [16:15:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:15:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:13] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [16:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20110 and previous config saved to /var/cache/conftool/dbconfig/20220203-161515-marostegui.json [16:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:44] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edi [16:15:44] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:15:44] (03PS1) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 [16:16:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20111 and previous config saved to /var/cache/conftool/dbconfig/20220203-161622-marostegui.json [16:16:23] (03Abandoned) 10Ahmon Dancy: MWMultiVersion.php: Flexible wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [16:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] (03PS1) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for bullseye build [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 [16:19:49] 10SRE, 10Observability-Logging: Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10colewhite) p:05Triage→03High [16:23:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298558)', diff saved to https://phabricator.wikimedia.org/P20113 and previous config saved to /var/cache/conftool/dbconfig/20220203-162316-marostegui.json [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:20] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [16:26:53] 10SRE, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) via IRC chat: We'll move cloudmetrics1004 to the same kernel as 1003 and call this done. [16:27:39] (03PS2) 10Aklapper: mediawiki: Better error page layout on mobile devices [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247) (owner: 10Phantom42) [16:27:50] (03PS4) 10Aklapper: Set $wgUploadNavigationUrl for few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [16:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P20114 and previous config saved to /var/cache/conftool/dbconfig/20220203-163127-marostegui.json [16:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:19] (03PS3) 10Aklapper: Update cron with lb and lb-pool params [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) (owner: 10Alaa Sarhan) [16:33:51] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-Needs-Improvement: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10Aklapper) [16:35:28] (03PS13) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [16:39:47] (03CR) 10Krinkle: "Doesn't seem to look quite right when I tried it in Firefox: https://phabricator.wikimedia.org/F34941517" [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247) (owner: 10Phantom42) [16:39:59] (03PS1) 10Alexandros Kosiaris: Remove upstart/sysvinit file [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759525 [16:40:01] (03PS1) 10Alexandros Kosiaris: Bump requirements to match 1.8.14 upstream [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759526 [16:40:03] (03PS1) 10Alexandros Kosiaris: Refresh local patches, drop X-Client-IP logging [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759527 [16:40:05] (03PS1) 10Alexandros Kosiaris: Add execute permission to npm-cli.js [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759528 [16:40:07] (03PS1) 10Alexandros Kosiaris: Bump changelog to 1.8.14 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759529 [16:40:09] (03PS1) 10Alexandros Kosiaris: Bump to 1.8.16 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/759530 [16:45:13] 10SRE, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) 05Open→03Resolved 1003 and 1004 are both running 5.10 and seem happy. [16:45:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Andrew) [16:45:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Andrew) [16:46:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P20115 and previous config saved to /var/cache/conftool/dbconfig/20220203-164632-marostegui.json [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:20] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and services privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Papaul) 05Open→03Resolved This is complete tested it on ml-serve2008. thanks @Dzahn @Ladsgroup [16:51:44] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff 1016 is fixed and accessible, 1011 and 1020 updating now [16:53:51] 10SRE, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10lmata) [16:56:37] (03PS2) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for bullseye build [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 [16:57:36] (03CR) 10Hashar: "I had to add `grub-install /dev/sda` in the list of customizations which got Grub updated and the images boot fine now!" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20116 and previous config saved to /var/cache/conftool/dbconfig/20220203-170136-marostegui.json [17:01:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [17:01:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [17:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:43] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:01:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T300402)', diff saved to https://phabricator.wikimedia.org/P20117 and previous config saved to /var/cache/conftool/dbconfig/20220203-170144-marostegui.json [17:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300402)', diff saved to https://phabricator.wikimedia.org/P20118 and previous config saved to /var/cache/conftool/dbconfig/20220203-170351-marostegui.json [17:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [17:06:16] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) 1011 and 1020 have been updated [17:09:17] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:10:35] (03PS3) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for bullseye build [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 [17:11:47] (03PS1) 10Michael DiPietro: mcrouter for bullseye. This does some unexpected things, in particular build.sh doesn't seem to work, errors out with a seg fault. README updated with instructions on how to make it work. [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759539 (https://phabricator.wikimedia.org/T300578) [17:12:45] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts restbase2011.codfw.wmnet [17:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) a:05JMeybohm→03None [17:13:45] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10JMeybohm) [17:13:47] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2011.codfw.wmnet [17:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:50] 10ops-eqiad, 10DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T300820 (10wiki_willy) a:03Cmjohnson [17:18:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P20120 and previous config saved to /var/cache/conftool/dbconfig/20220203-171856-marostegui.json [17:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:44] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts restbase2011.codfw.wmnet [17:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:31] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10akosiaris) @Dzahn I 've pushed to gerrit the new git-buildpackage upstream changes (bumped first to 1.8.14 and 1.8.16). I 've bypassed code re... [17:25:38] (03PS1) 10Dduvall: beta: Discover etcd servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) [17:28:33] (03PS2) 10Michael DiPietro: mcrouter for bullseye. This does some unexpected things, in particular build.sh doesn't seem to work, errors out with a seg fault. README updated with instructions on how to make it work. [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759539 (https://phabricator.wikimedia.org/T300578) [17:34:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P20122 and previous config saved to /var/cache/conftool/dbconfig/20220203-173400-marostegui.json [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:22] (03PS1) 10Jdlrobson: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) [17:36:30] (03PS1) 10Jdlrobson: Update skin checks with new vector skin key. [extensions/ContentTranslation] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759314 (https://phabricator.wikimedia.org/T298916) [17:36:38] (03CR) 10jerkins-bot: [V: 04-1] Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [17:36:53] (03PS1) 10Btullis: Remove temporary additional port for AQS servers. [homer/public] - 10https://gerrit.wikimedia.org/r/759546 (https://phabricator.wikimedia.org/T291472) [17:37:17] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10JMeybohm) > I'm unsure if we should use one of the many kubernetes python libraries available: > > * python-kubernetes, the official library, is quite hard to package... [17:39:01] (03CR) 10Ayounsi: [C: 03+1] Remove temporary additional port for AQS servers. [homer/public] - 10https://gerrit.wikimedia.org/r/759546 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [17:40:33] (03PS4) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for bullseye build [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) [17:41:19] (03CR) 10Ayounsi: [C: 03+2] Remove temporary additional port for AQS servers. [homer/public] - 10https://gerrit.wikimedia.org/r/759546 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [17:41:52] (03Merged) 10jenkins-bot: Remove temporary additional port for AQS servers. [homer/public] - 10https://gerrit.wikimedia.org/r/759546 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [17:43:52] (03CR) 10Daniel Kinzler: [C: 03+1] "We want this, but I can't tell if this will in fact discover the correct host." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [17:46:51] (03PS1) 10Dzahn: planet: add diff.wikimedia.org/feed to en.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/759547 [17:48:42] (03PS2) 10Dzahn: planet: add diff.wikimedia.org/feed to en.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/759547 [17:49:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300402)', diff saved to https://phabricator.wikimedia.org/P20125 and previous config saved to /var/cache/conftool/dbconfig/20220203-174905-marostegui.json [17:49:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [17:49:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [17:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:10] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T300402)', diff saved to https://phabricator.wikimedia.org/P20126 and previous config saved to /var/cache/conftool/dbconfig/20220203-174913-marostegui.json [17:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:47] (03PS1) 10Majavah: planet: add my blog (taavi.wtf) [puppet] - 10https://gerrit.wikimedia.org/r/759548 [17:50:33] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300402)', diff saved to https://phabricator.wikimedia.org/P20127 and previous config saved to /var/cache/conftool/dbconfig/20220203-175120-marostegui.json [17:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:45] (03CR) 10Dzahn: "If i knew people who are in "movement comms" and also have a Gerrit user, I would add them here. feel free to do that if you do" [puppet] - 10https://gerrit.wikimedia.org/r/759547 (owner: 10Dzahn) [17:55:07] (03PS1) 10Ayounsi: Fix network upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/759549 [17:56:50] (03CR) 10Dzahn: [C: 03+2] "nice, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/759548 (owner: 10Majavah) [17:57:19] (03PS5) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for Debian Bullseye [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) [17:57:21] (03PS1) 10Arturo Borrero Gonzalez: mcrouter: add .gitreview file [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 [17:57:23] (03PS1) 10Arturo Borrero Gonzalez: docker_entry.sh: override debian mirror [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 [17:57:25] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 2022.01.31.00 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 [17:57:27] (03PS1) 10Arturo Borrero Gonzalez: gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 [17:58:14] beta cluster is down? [17:58:18] (03PS2) 10Arturo Borrero Gonzalez: mcrouter: add .gitreview file [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 [17:58:20] (03PS2) 10Arturo Borrero Gonzalez: docker_entry.sh: override debian mirror [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 [17:58:22] (03PS6) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for Debian Bullseye [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) [17:58:24] (03PS2) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 2022.01.31.00 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 [17:58:26] (03PS2) 10Arturo Borrero Gonzalez: gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 [17:58:46] edsanders: looks so, yeah [17:59:03] this seems like a good reminder that there hasn't been anyone resposible for maintaining it for multiple years [18:00:05] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T1800). [18:00:08] (03CR) 10Majavah: [C: 04-1] d/changelog: generate entry for 2022.01.31.00 (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 (owner: 10Arturo Borrero Gonzalez) [18:00:21] (03CR) 10Dduvall: beta: Discover etcd servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [18:00:35] deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud is up. There are about 8 php-fpm processes sucking up CPU time [18:01:23] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10jcrespo) a:03jcrespo [18:01:52] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10jcrespo) [18:01:56] mediawiki12? I thought that was only a temporary box for testing PHP 7.4 puppetization, when has someone moved traffic to it? [18:02:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/758584 [18:02:44] due to T300591 [18:02:45] T300591: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 [18:03:11] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10jcrespo) p:05Triage→03High Everything seems to be fine, only 2 steps left are approvals from @AUgolnikova-WMF's manager and Analytics ok (updated on the ticket). [18:03:20] umh [18:03:39] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10jcrespo) [18:03:46] mediawiki11 is still receiving traffic for w-beta.wmflabs.org and from most services (https://codesearch.wmcloud.org/search/?q=mediawiki11&i=nope&files=&excludeFiles=&repos=) [18:04:10] and now it's not receiving code updates? [18:04:33] well that's confusing [18:05:00] * taavi lets someone who depends on beta working deal with this mess [18:05:32] (03CR) 10Cwhite: initial sketch of watchrat alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [18:06:04] I'm poking around a bit [18:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P20128 and previous config saved to /var/cache/conftool/dbconfig/20220203-180624-marostegui.json [18:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:34] https://phabricator.wikimedia.org/rCLIP35f8c97bd92ad240f2d4a52f6d37d916c911f977 also looks related [18:07:37] I can confirm that `deployment-mediawiki12` (but not mediawiki11) is in `/etc/dsh/group/mediawiki-installation` [18:08:35] (03PS1) 10DLynch: New bucket for abtest data [extensions/VisualEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759315 (https://phabricator.wikimedia.org/T291308) [18:08:46] (03PS1) 10DLynch: New bucket for abtest data [extensions/WikiEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759316 (https://phabricator.wikimedia.org/T291308) [18:09:44] I killed the php7.2-fpm processes on mediawiki12. There are php7.4-fpm processes that still exist (but aren't spinning) [18:10:02] !log killed 8 spinning php7.2-fpm processes on mediawiki12 [18:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:32] It seems like routing to en.wikipedia.beta.wmflabs.org is the outstanding problem. [18:12:08] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10Ottomata) Approved! [18:13:45] (03PS3) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 2022.01.31.00 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 [18:13:47] (03PS3) 10Arturo Borrero Gonzalez: gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 [18:14:00] (03CR) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 2022.01.31.00 (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 (owner: 10Arturo Borrero Gonzalez) [18:15:13] cripes [18:17:31] !log restarted php7.2-fpm processes on mediawiki12 [18:17:33] (03PS1) 104nn1l2: commonswiki: Add three domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759557 (https://phabricator.wikimedia.org/T299835) [18:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:32] https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page is responding now [18:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P20129 and previous config saved to /var/cache/conftool/dbconfig/20220203-182129-marostegui.json [18:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:11] (03CR) 10Bking: [C: 03+2] Upgrade to elasticsearch 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph) [18:26:42] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/759549 (owner: 10Ayounsi) [18:29:24] (03PS1) 10Majavah: hieradata: deployment-prep: re-add mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/759559 [18:30:01] 10SRE, 10Wikimedia-Mailing-lists: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10jcrespo) Here is some links with some discussion about doing that on mailman3 (it has some gotchas): https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/... [18:30:03] (03CR) 10Ahmon Dancy: [C: 03+1] hieradata: deployment-prep: re-add mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/759559 (owner: 10Majavah) [18:31:44] (03CR) 10Dduvall: beta: Discover etcd servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [18:32:36] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: Character encoding issues on daily-image-l - https://phabricator.wikimedia.org/T295096 (10jcrespo) T282621 doesn't seem to have fixed this, it is still happening: https://lists.wikimedia.org/hyperkitty/list/daily-image-l@lists.wikimedia.org/thread/UQBNU... [18:33:19] (03CR) 10Majavah: "What are you planning on using this on? There isn't really any data stored on deployment-etcd*." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [18:34:14] (03PS4) 10Herron: watchrat: add http probe alerting with warning severity [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) [18:34:37] (03CR) 10Dduvall: beta: Discover etcd servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [18:34:49] (03CR) 10Jbond: [C: 03+2] hieradata: deployment-prep: re-add mediawiki11 [puppet] - 10https://gerrit.wikimedia.org/r/759559 (owner: 10Majavah) [18:35:05] (03CR) 10Herron: "Thanks for the feedback!" [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [18:36:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300402)', diff saved to https://phabricator.wikimedia.org/P20130 and previous config saved to /var/cache/conftool/dbconfig/20220203-183634-marostegui.json [18:36:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [18:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [18:36:39] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [18:36:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T300402)', diff saved to https://phabricator.wikimedia.org/P20131 and previous config saved to /var/cache/conftool/dbconfig/20220203-183648-marostegui.json [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:14] (03PS1) 10Clare Ming: Update icons, wordmark for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759560 (https://phabricator.wikimedia.org/T299512) [18:37:48] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10jcrespo) [18:37:56] (03CR) 10Faidon Liambotis: "mirrors.wikimedia.org is a sync mirror (and 1/3rd of the ftp.us.debian.org rotation) so it should not be missing any packages. If it does," [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [18:42:46] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:09] (03PS3) 10Michael DiPietro: mcrouter for bullseye. This does some unexpected things, in particular build.sh doesn't seem to work, errors out with a seg fault. README updated with instructions on how to make it work. [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759539 (https://phabricator.wikimedia.org/T300578) [18:57:25] (03CR) 10Clare Ming: [C: 03+1] Update skin checks with new vector skin key. [extensions/ContentTranslation] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759314 (https://phabricator.wikimedia.org/T298916) (owner: 10Jdlrobson) [18:58:21] (03PS1) 104nn1l2: Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) [19:00:05] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T1900). [19:00:05] Jdlrobson, kemayo, and nn1l2: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] hi [19:00:14] i can deploy today! [19:00:27] * taavi around too [19:00:41] 👍🏻 [19:00:44] taavi: do you want to lead the window to practice deployment? :)) [19:00:56] sure :P [19:01:01] (03CR) 10Urbanecm: [C: 03+2] New bucket for abtest data [extensions/VisualEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759315 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:01:05] (03CR) 10Urbanecm: [C: 03+2] New bucket for abtest data [extensions/WikiEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759316 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:01:05] wow that's a lot of backports [19:01:12] Jdlrobson: hey, around [19:01:13] (03PS2) 10Jdlrobson: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) [19:01:14] ? [19:01:21] (03Abandoned) 10Jdlrobson: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [19:01:22] taavi: go ahead then :) [19:01:33] I'm around if you need me. [19:01:37] cool [19:01:45] hello [19:01:46] Kemayo: does the order of your patches matter? [19:01:53] It does not. [19:01:58] ok [19:02:12] Jdlrobson: same question to you, does the order of your patches matter? [19:02:33] taavi: fyi i +2'ed Kemayo's backports already to save time on CI (feel free to keep the +2 or remove, up2you). [19:02:40] yeah I saw, thx [19:02:41] hi taavi yes order matters [19:02:57] Pass skin name to Hooks::isSkinLegacy first [19:03:21] Drop skin override last (also having a few issues with Jenkins with that one so we can leave that until later in the window) [19:03:57] "Pass skin name to Hooks::isSkinLegacy" links to https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/759283, which has a different commit message? [19:04:12] (03Restored) 10Jdlrobson: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [19:04:45] nn1l2: let's do your config patch while the backports are in CI [19:04:51] ok [19:06:06] (03CR) 10Jdlrobson: [C: 04-1] "Not sure what's going on here." [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [19:06:37] urbanecm: do you know about s3 urls? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/759557 adds one to the commons upload-by-url allowlist, I'm wondering if that subdomain only used for that particular s3 customer or if s3 customers share subdomains? [19:07:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:07:44] taavi: we allowlisted several s3 domains in the past, so I think it should be fine. AFAIK, it's only used by the particular customer. [19:07:55] great [19:07:55] taavi: We can backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/759314 and https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/759310 together [19:07:58] (03CR) 10Majavah: [C: 03+2] commonswiki: Add three domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759557 (https://phabricator.wikimedia.org/T299835) (owner: 104nn1l2) [19:08:16] (those should be done first) [19:08:41] (03Merged) 10jenkins-bot: commonswiki: Add three domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759557 (https://phabricator.wikimedia.org/T299835) (owner: 104nn1l2) [19:09:10] nn1l2: your patch is available for testing on mwdebug1001, please test [19:09:11] (03PS3) 10Jdlrobson: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) [19:09:18] ok [19:10:14] Jdlrobson: ok, does it matter which one of those patches goes first? [19:10:28] taavi: no. Those two can go out at the same time [19:10:33] or any order [19:11:01] (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/ContentTranslation] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759314 (https://phabricator.wikimedia.org/T298916) (owner: 10Jdlrobson) [19:11:07] (03CR) 10Majavah: [C: 03+2] "deploying" [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759310 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [19:11:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:12:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:52] among three, two works and one fails [19:13:01] what should we do? [19:13:22] Please see https://phabricator.wikimedia.org/T299835#7674553 [19:13:42] can you upload a patch which removes the failing one? and then I'll sync both out at once [19:14:15] The user wanted to add original urls too, but they don't work as they are merely redirects, I think [19:14:23] OK [19:15:35] (03Merged) 10jenkins-bot: New bucket for abtest data [extensions/VisualEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759315 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:16:08] (03Merged) 10jenkins-bot: New bucket for abtest data [extensions/WikiEditor] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759316 (https://phabricator.wikimedia.org/T291308) (owner: 10DLynch) [19:16:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:48] Kemayo: hi, both of your patches are available for testing on mwdebug1002 [19:18:03] I'll take a look [19:19:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) [19:19:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) [19:21:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:42] 10ops-eqiad, 10DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T300820 (10Cmjohnson) @bblack most likely need to swap out the sfp+ for this link. Do you need to downtime anything for me to do this? [19:23:23] (03PS1) 104nn1l2: commonswiki: Remove images.collections.yale.edu from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759569 (https://phabricator.wikimedia.org/T299835) [19:23:45] taavi: Okay, looks good. Sorry it took me a minute -- I forgot .20 wasn't out to all wikipedias yet so I was checking the wrong wiki. [19:23:53] sorry for the delay: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/759569/ [19:23:55] (03CR) 10Majavah: [C: 03+2] commonswiki: Remove images.collections.yale.edu from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759569 (https://phabricator.wikimedia.org/T299835) (owner: 104nn1l2) [19:24:11] Kemayo: great! I'll sync them out after I've synced nn1l2's patches [19:25:16] (03Merged) 10jenkins-bot: commonswiki: Remove images.collections.yale.edu from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759569 (https://phabricator.wikimedia.org/T299835) (owner: 104nn1l2) [19:25:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:25:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:39] (03Merged) 10jenkins-bot: Update skin checks with new vector skin key. [extensions/ContentTranslation] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759314 (https://phabricator.wikimedia.org/T298916) (owner: 10Jdlrobson) [19:25:53] (03Merged) 10jenkins-bot: Pass skin name to Hooks::isSkinLegacy [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759310 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [19:26:23] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759557|commonswiki: Add three domains to the wgCopyUploadsDomains allowlist (T299835 T300848)]] (duration: 00m 54s) [19:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:28] T299835: Add images.collections.yale.edu to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T299835 [19:26:29] T300848: Add oxalis.br.fgov.be to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300848 [19:26:38] Kemayo: ok, syncing your patches now [19:26:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:26:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:27:30] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.trackSubscriber.js: Backport: [[gerrit:759315|New bucket for abtest data (T291308)]] (duration: 00m 50s) [19:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:34] T291308: Make config change to start New Discussion Tool A/B Test - https://phabricator.wikimedia.org/T291308 [19:28:29] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/WikiEditor/includes/Hooks.php: Backport: [[gerrit:759316|New bucket for abtest data (T291308)]] (1/2) (duration: 00m 49s) [19:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:23] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/WikiEditor/modules/ext.wikiEditor.js: Backport: [[gerrit:759316|New bucket for abtest data (T291308)]] (2/2) (duration: 00m 50s) [19:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:32] Jdlrobson: hey, your patches are available for testing on mwdebug1001 [19:30:48] taavi: yey. Thanks! looking.... [19:31:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300402)', diff saved to https://phabricator.wikimedia.org/P20132 and previous config saved to /var/cache/conftool/dbconfig/20220203-193208-marostegui.json [19:32:11] taavi: LGTM. Please sync! [19:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:12] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [19:32:15] sure [19:33:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:33:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [19:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:15] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/ContentTranslation/modules/entrypoints/ext.cx.entrypoints.contributionsmenu.js: Backport: [[gerrit:759314|Update skin checks with new vector skin key. (T298916 T300814)]] (duration: 00m 50s) [19:33:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [19:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:21] T298916: Any code that checks getSkinName for vector must now also check vector-2022 - https://phabricator.wikimedia.org/T298916 [19:33:21] T300814: Live preview doesn't work with Vector 2022 skin - https://phabricator.wikimedia.org/T300814 [19:34:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:34:22] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/skins/Vector/includes/Hooks.php: Backport: [[gerrit:759310|Pass skin name to Hooks::isSkinLegacy (T299971)]] (duration: 00m 49s) [19:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:27] T299971: [subtask] Problem in Legacy Vector calculation ([{reqId}] {exception_url} PHP Notice: Undefined index: data-user-page ) - https://phabricator.wikimedia.org/T299971 [19:34:30] Jdlrobson: both of them are now live! [19:34:34] anything else? [19:35:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [19:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster [19:37:09] taavi: yep last patch.. [19:37:31] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/759313 [19:39:10] the latest PS of that one hasn't been +2'd on master yet? [19:39:17] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1004.eqiad.wmnet with OS buster [19:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [19:39:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [19:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [19:40:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:40:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:02] taavi: looks like it was +2ed twice but didnt go through [19:41:06] (03PS2) 104nn1l2: Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) [19:41:23] Gimme a few mins to get my reviewer to merge it again [19:41:29] this one's pretty urgent to go out now [19:41:36] Jdlrobson: no, patch set 3 was +2'd and then you uploaded a new patch modified patch set [19:41:37] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1003.eqiad.wmnet with OS buster [19:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster... [19:41:43] sure [19:41:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:41:50] taavi: i know miscommunication with Clare [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:01] chatting to her now [19:42:13] done [19:42:54] (we were pairing on this one and got our wires crossed) [19:43:04] (03CR) 10Majavah: [C: 03+2] "deploying" [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [19:43:10] thanks! [19:43:17] thanks taavi and sorry about all that confusion [19:43:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We can use gerrit as well without a .gitreview file, so please correct the commit message to say "to ease the use of git-review"." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 (owner: 10Arturo Borrero Gonzalez) [19:44:04] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We're not building this package on our own laptop, and I don't see the reason for this switch. Quite the contrary, actually." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [19:45:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [19:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster [19:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P20133 and previous config saved to /var/cache/conftool/dbconfig/20220203-194712-marostegui.json [19:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:03] dancy, brennen: apologies in advance if the backport window finishes a few minutes into the train window [19:51:16] no prob [19:51:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM; but I would suggest we first stick to the same version we're using on buster, which is the last that had a proper release pr" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) (owner: 10Arturo Borrero Gonzalez) [19:52:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 (owner: 10Arturo Borrero Gonzalez) [19:53:41] dancy: thanks :) [19:58:26] akosiaris: hi, would you be able to review a puppet patch adding a new periodic_job? Thanks in any case [19:59:56] hauskatze: it's late at night for him but I can take a look :) [20:00:05] dancy and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220203T2000). [20:00:22] of course jenkins is taking ages with our last backport [20:00:40] rzl: oh, didn't know; hope I didn't wake him up :) [20:00:56] taavi: :( [20:01:09] (03PS1) 10Ryan Kemper: Revert "Revert "elasticsearch: activate role (step 2)"" [puppet] - 10https://gerrit.wikimedia.org/r/759317 [20:01:21] I wish Jenkins could prioritize backport patches [20:01:25] (around for when backport finishes) [20:01:26] (03PS15) 10MarcoAurelio: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) [20:01:29] e.g. pause everything else [20:01:38] rzl: it'd be: https://gerrit.wikimedia.org/r/c/operations/puppet/+/756069 [20:01:48] (03Merged) 10jenkins-bot: Drop skin override [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759313 (https://phabricator.wikimedia.org/T300814) (owner: 10Jdlrobson) [20:01:49] jenkins does support priorities (to some extend), but all of gate-and-submit has the high(est) IIRC [20:01:51] Jdlrobson: it does start them before non-backports, but once they're running they're treated the same [20:02:09] or maybe gate-and-submit-wmf and gate-and-submit are equal [20:02:13] I have to leave for a moment; rzl - half an hour at most [20:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P20134 and previous config saved to /var/cache/conftool/dbconfig/20220203-200217-marostegui.json [20:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:25] finally [20:02:48] (03PS2) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/759317 (https://phabricator.wikimedia.org/T294805) [20:02:49] Jdlrobson: can you test on mwdebug1001? [20:03:08] (03PS3) 10Dzahn: planet: add diff.wikimedia.org/feed to en.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/759547 (https://phabricator.wikimedia.org/T230444) [20:03:10] (03PS1) 10Ladsgroup: Add change_ar_timestamp_T298554.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759578 (https://phabricator.wikimedia.org/T298554) [20:03:35] (03CR) 10jerkins-bot: [V: 04-1] planet: add diff.wikimedia.org/feed to en.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/759547 (https://phabricator.wikimedia.org/T230444) (owner: 10Dzahn) [20:03:40] taavi: on it [20:03:45] (03PS3) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/759317 (https://phabricator.wikimedia.org/T294805) [20:05:02] just noting here that I'll sync skin.json first and hooks.php second, that way we won't have fatals in the middle [20:05:05] (03PS4) 10Dzahn: planet: add diff.wikimedia.org/feed to en.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/759547 (https://phabricator.wikimedia.org/T230444) [20:05:20] taavi: perfect. Please sync [20:05:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1004.eqiad.wmnet with OS buster [20:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:30] ack [20:05:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [20:05:39] (03CR) 10Dzahn: [C: 03+2] "turns out there was a ticket for this https://phabricator.wikimedia.org/T230444" [puppet] - 10https://gerrit.wikimedia.org/r/759547 (https://phabricator.wikimedia.org/T230444) (owner: 10Dzahn) [20:06:26] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/skins/Vector/skin.json: Backport: [[gerrit:759313|Drop skin override (T300814)]] (1/2) (duration: 00m 49s) [20:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:30] T300814: Live preview doesn't work with Vector 2022 skin - https://phabricator.wikimedia.org/T300814 [20:07:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:19] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.20/skins/Vector/includes/Hooks.php: Backport: [[gerrit:759313|Drop skin override (T300814)]] (2/2) (duration: 00m 49s) [20:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] deployed! [20:07:41] thanks taavi ! [20:07:44] dancy, brennen: over to you! sorry about the delay again [20:08:04] thx.. pressing buttons.. [20:08:07] (03CR) 10Bking: [V: 03+1] "Bring 'em on!" [puppet] - 10https://gerrit.wikimedia.org/r/759317 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:08:08] * Jdlrobson prays he doens't need to use the backport window later today for anything [20:08:13] (03PS1) 10Kosta Harlan: Correct attribute for flow thanks [extensions/Thanks] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759319 (https://phabricator.wikimedia.org/T300831) [20:08:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:08:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:38] !log planet1002/planet2002 - sudo systemctl start planet-update-en to manually start update after adding diff.wikimedia.org T230444 [20:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:42] T230444: Add Diff to Planet Wikimedia - https://phabricator.wikimedia.org/T230444 [20:08:47] (03PS1) 10Ahmon Dancy: group2 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759580 [20:08:49] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759580 (owner: 10Ahmon Dancy) [20:09:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:09:24] (03CR) 10RLazarus: [C: 03+1] mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [20:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:14] (03CR) 10DLynch: [C: 03+1] Correct attribute for flow thanks [extensions/Thanks] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759319 (https://phabricator.wikimedia.org/T300831) (owner: 10Kosta Harlan) [20:10:17] hauskatze: lgtm, +1ed -- let me know if you're ready for me to merge it [20:10:20] (03Merged) 10jenkins-bot: group2 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759580 (owner: 10Ahmon Dancy) [20:10:58] taavi, my last patch should be rescheduled? [20:11:31] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.20 refs T293961 [20:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:34] T293961: 1.38.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T293961 [20:11:45] nn1l2: if you're talking about the patrolmarks on, then yes please, sadly we didn't have enough time in this window [20:12:02] OK, thanks! [20:12:47] wmf.20 is at group2. Feel free to perform additional backports now if needed [20:13:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS buster [20:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster... [20:14:09] someone else can if needed.. I need to step away [20:14:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:29] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/759317 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:15:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:15:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] !log T294805 Disabled puppet on `elastic1*` in preparation for brining new hosts into service: `ryankemper@cumin1001:~$ sudo cumin 'elastic1*' 'sudo disable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805"'` [20:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:21] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [20:17:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300402)', diff saved to https://phabricator.wikimedia.org/P20135 and previous config saved to /var/cache/conftool/dbconfig/20220203-201721-marostegui.json [20:17:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [20:17:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [20:17:25] !log T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/759317 to activate roles for elastic eqiad replacement hosts [20:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [20:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T300402)', diff saved to https://phabricator.wikimedia.org/P20136 and previous config saved to /var/cache/conftool/dbconfig/20220203-201729-marostegui.json [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300402)', diff saved to https://phabricator.wikimedia.org/P20137 and previous config saved to /var/cache/conftool/dbconfig/20220203-201836-marostegui.json [20:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:56] !log T294805 Running puppet on single elastic host: `ryankemper@elastic1068:~$ sudo run-puppet-agent --force` [20:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:01] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [20:25:21] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mx1001.wikimedia.org with reason: systemd testing [20:25:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx1001.wikimedia.org with reason: systemd testing [20:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:14] !log T294805 Running puppet on `elastic1068` failed, looks like `/usr/share/elasticsearch/lib' wasn't there: https://phabricator.wikimedia.org/P20138 [20:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:46] !log T294805 Running puppet on `elastic1068` failed, looks like `/usr/share/elasticsearch/lib` wasn't there: https://phabricator.wikimedia.org/P20138 [20:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:31:00] PROBLEM - Check systemd state on elastic1078 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:36] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:33:22] rzl: I'm back. So with taavi 's testing of it on beta cluster and the +1 from you and legoktm I think we can merge [20:33:38] IF you think that's okay [20:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P20139 and previous config saved to /var/cache/conftool/dbconfig/20220203-203341-marostegui.json [20:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:57] yes please :) [20:34:17] I ran PCC for both datacenters just in case [20:34:31] oh hi legoktm :) [20:34:54] hauskatze: sounds good! going ahead [20:34:56] hi legoktm :) [20:35:06] (03CR) 10RLazarus: [C: 03+2] mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [20:35:11] Hi :D [20:35:53] Wrong category counts no more! (at least once a month) [20:38:26] and the patch shows correctly co-authored too in github [20:38:57] systemctl output on mwmaint1002 looks right 👍 [20:38:59] rzl: it would be nice if we/you could start the systemd units now since we just passed the 1st [20:39:09] yeah was just about to ask [20:40:07] hauskatze: does that sound okay to you too? [20:42:06] rzl: sure - would it be possible to see or dump the log of the first run in production to ensure it ran as expected? [20:42:26] can do [20:42:30] I'd be curious to know how many broken category counts and empty cats get fixed [20:42:47] I'd appreciate it if you could rzl :) [20:43:10] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_recount_categories.service # T299823 [20:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:14] T299823: Regularly run recountCategories.php on Wikimedia wikis via systemd timer - https://phabricator.wikimedia.org/T299823 [20:43:59] churning through arwiki now, I'll post logs to that task when it's finished [20:44:11] awesome [20:44:21] I'll shave in the meantime :) [20:45:29] (03CR) 10JHathaway: ferm: replace systemd unit to ensure success on boot (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [20:48:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P20140 and previous config saved to /var/cache/conftool/dbconfig/20220203-204846-marostegui.json [20:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:34] (03PS1) 10Ryan Kemper: elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) [20:50:09] (03CR) 10Bking: [C: 03+1] elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:50:12] (03CR) 10jerkins-bot: [V: 04-1] elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:51:42] (03PS2) 10Ryan Kemper: elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) [20:52:29] (03CR) 10jerkins-bot: [V: 04-1] elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:54:04] (03PS1) 10Andrew Bogott: Add some comments to the rules for backup traffic to eqiad ceph [homer/public] - 10https://gerrit.wikimedia.org/r/759590 [20:56:44] (03CR) 10Andrew Bogott: "not sure if this is where you needed these but you can copy/paste as appropriate" [homer/public] - 10https://gerrit.wikimedia.org/r/759590 (owner: 10Andrew Bogott) [20:57:39] (03PS4) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) [21:00:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=citoid.svc.eqiad.wmnet, port=4003): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Citoid [21:01:29] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:03:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300402)', diff saved to https://phabricator.wikimedia.org/P20142 and previous config saved to /var/cache/conftool/dbconfig/20220203-210350-marostegui.json [21:03:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [21:03:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [21:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:57] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [21:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T300402)', diff saved to https://phabricator.wikimedia.org/P20143 and previous config saved to /var/cache/conftool/dbconfig/20220203-210358-marostegui.json [21:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300402)', diff saved to https://phabricator.wikimedia.org/P20144 and previous config saved to /var/cache/conftool/dbconfig/20220203-210607-marostegui.json [21:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:54] (03CR) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [21:13:43] PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:28] (03PS3) 10Ryan Kemper: elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) [21:15:04] (03CR) 10jerkins-bot: [V: 04-1] elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:15:35] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33574/console" [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:17:24] (03PS4) 10Ryan Kemper: elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) [21:20:24] (03CR) 10Ryan Kemper: [C: 03+2] elastic: make wmf-es-search-plugins req es package [puppet] - 10https://gerrit.wikimedia.org/r/759588 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P20145 and previous config saved to /var/cache/conftool/dbconfig/20220203-212111-marostegui.json [21:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:30] !log T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/759588; hoping this resolves dependency issues. Running puppet agent on `elastic1068` [21:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:34] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [21:25:54] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) >>! In T300324#7673609, @RLazarus wrote: > reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye. Correction: 1.15.5 is only in buster... [21:27:42] !log root@apt1001:/home/rzl# reprepro copy stretch-wikimedia buster-wikimedia envoyproxy # T300324 [21:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:46] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [21:27:59] !log root@apt1001:/home/rzl# reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy # T300324 [21:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P20147 and previous config saved to /var/cache/conftool/dbconfig/20220203-213616-marostegui.json [21:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:56] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [21:49:54] (03PS1) 10Ryan Kemper: elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) [21:51:07] (03CR) 10jerkins-bot: [V: 04-1] elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300402)', diff saved to https://phabricator.wikimedia.org/P20148 and previous config saved to /var/cache/conftool/dbconfig/20220203-215121-marostegui.json [21:51:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [21:51:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [21:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [21:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [21:51:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [21:51:34] (03PS2) 10Ryan Kemper: elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) [21:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [21:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [21:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [21:51:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [21:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T300402)', diff saved to https://phabricator.wikimedia.org/P20149 and previous config saved to /var/cache/conftool/dbconfig/20220203-215154-marostegui.json [21:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:18] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33576/console" [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:52:23] (03CR) 10Bking: [C: 03+1] elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300402)', diff saved to https://phabricator.wikimedia.org/P20150 and previous config saved to /var/cache/conftool/dbconfig/20220203-215402-marostegui.json [21:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:12] how's the script doing rzl - still running? :) [21:54:19] (03PS3) 10Ryan Kemper: elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) [21:55:38] (03CR) 10Ryan Kemper: [C: 03+2] elastic: es pkg needs 3rd party comp [puppet] - 10https://gerrit.wikimedia.org/r/759617 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [21:59:03] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1068-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:04:03] !log volans@cumin2002 START - Cookbook sre.dns.netbox [22:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:51] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759560 (https://phabricator.wikimedia.org/T299512) (owner: 10Clare Ming) [22:09:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P20152 and previous config saved to /var/cache/conftool/dbconfig/20220203-220906-marostegui.json [22:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:27] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:38] !log T294805 https://gerrit.wikimedia.org/r/c/operations/puppet/+/759617/ fixed the dependency issues, going to start bringing new hosts into service [22:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:42] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [22:14:03] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic1068-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:16:59] ^ Above alert is likely a result of shard reallocation / shuffling occurring as we introduce new hosts (starting with `elastic1068`) to the fleet; will keep an eye on things but the alert can be disregarded for the timebeing [22:18:16] !log T294805 Bringing in new eqiad hosts in batches of 4, with 15-20 mins between batches: `ryankemper@cumin1001:~$ sudo -E cumin -b 4 'elastic1*' 'sudo run-puppet-agent --force; sudo run-puppet-agent; sleep 900'` tmux session `es_eqiad` [22:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:31] !log T294805 Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&refresh=1m&from=now-3h&to=now as new hosts join the fleet [22:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:33] ryankemper: cumin ssh as root so no need to sudo in the commands to execute, also why running puppet twice? [22:23:56] volans: ah thanks on the sudo tip [22:24:03] (CirrusSearchJVMGCOldPoolFlatlined) firing: (3) Elasticsearch instance elastic1068-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:24:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P20153 and previous config saved to /var/cache/conftool/dbconfig/20220203-222411-marostegui.json [22:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:51] volans: unfortunately running puppet twice because the logstash packages implicitly depends on the elasticsearch-oss package being installed, so the logstash stuff fails on the first puppet run [22:25:06] ryankemper: also, slighlty related, is always better to run puppet with the original comment with which you disabled it otherwise you might endup enabling puppet on a host where was already disabled with a different message. I understand it makes no difference here as those are new hosts, but you know, habits become muscle memory ;) [22:25:25] ah, that seems to require a puppet patch to fix it [22:25:28] volans: ack, good point on that [22:25:42] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:04] volans: yeah, we need to dig in a bit more and open up a patch and then tag o11y on it [22:26:17] ack, thx for the context [22:26:28] (but for the timebeing we don't want to block the bringing-in of these new hosts while we get that side of things figured out) [22:27:52] sure sure [22:29:03] (CirrusSearchJVMGCOldPoolFlatlined) firing: (3) Elasticsearch instance elastic1068-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:33:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) Hi @RKemper - just following up on this. With rows E and F coming online in a couple weeks (and rows A thru D being very tight on s... [22:34:03] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (3) Elasticsearch instance elastic1068-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:36:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RKemper) @RobH @wiki_willy Yes, rebalancing across E and F is fine by us. Would you like us to recompute the racking details or is it easier if... [22:37:32] RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:35] (03PS2) 10Clare Ming: Update icons, wordmark for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759560 (https://phabricator.wikimedia.org/T299512) [22:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300402)', diff saved to https://phabricator.wikimedia.org/P20154 and previous config saved to /var/cache/conftool/dbconfig/20220203-223916-marostegui.json [22:39:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:39:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:21] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [22:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20155 and previous config saved to /var/cache/conftool/dbconfig/20220203-223923-marostegui.json [22:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) Thanks @RKemper. If theres's a general criteria of no more than X number of servers spread across X number of rows, just let us kno... [22:41:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) a:05RKemper→03Jclark-ctr [22:48:56] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:49:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20156 and previous config saved to /var/cache/conftool/dbconfig/20220203-224933-marostegui.json [22:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:38] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [22:54:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RKemper) @wiki_willy @Jclark-ctr Since we're adding E and F to our list of allowable rows, we'll want to fill those new rows in as much as pos... [23:01:03] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Esanders) I still can't see thumbnails locally and on Patch demo. [23:03:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) [23:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P20157 and previous config saved to /var/cache/conftool/dbconfig/20220203-230437-marostegui.json [23:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:24] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Esanders) 05Resolved→03Open [23:05:29] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Esanders) [23:07:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) [23:07:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10wiki_willy) Thanks @RKemper, I appreciate it. I've gone ahead and updated the racking details in the task description to reflect that. Thanks,... [23:08:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RKemper) [23:13:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (4) Elasticsearch instance elastic1070-production-search-omega-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [23:13:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [23:15:30] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:15:36] !log T294805 Added a silence on alerts.wikimedia.org for `CirrusSearchJVMGCOldPoolFlatlined` [23:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:41] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P20158 and previous config saved to /var/cache/conftool/dbconfig/20220203-231942-marostegui.json [23:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300402)', diff saved to https://phabricator.wikimedia.org/P20159 and previous config saved to /var/cache/conftool/dbconfig/20220203-233447-marostegui.json [23:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:52] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [23:47:35] RECOVERY - Check systemd state on elastic1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:55] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook