[00:00:01] often then had follow-ups because it wasnt fully removed etc [00:00:04] brennen: #bothumor I οΏ½ Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211217T0000). [00:00:05] Juan_90264, Kemayo, and EricGardner: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:13] Present for today's backport deploy [00:00:15] I have the dump locally, so I'm just rsyncing it up and will move it in place [00:00:19] πŸ‘‹πŸ» [00:00:20] or "scp -3" via home but it's gonna be sloow [00:00:23] I'm present [00:00:29] ok [00:01:04] My patch has nothing testable by me, incidentally. So long as it doesn't cause any actual sudden surge of errors, it'll be good. [00:01:18] o/ [00:01:37] legoktm: I can move that to k8s miscweb mid-term (re: copied bz-static):) [00:01:45] :D [00:02:14] (03PS7) 10Juan90264: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) [00:02:14] howdy folks, running the training this evening so as usual this will be a bit slower window than a normal one. [00:02:33] but I will still need the content in a repo that builds the image [00:02:37] did though for bz-static, all gzipped [00:02:41] (03PS3) 10Juan90264: Change logo in abwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747687 (https://phabricator.wikimedia.org/T297810) [00:02:46] barely made it but it worked [00:03:15] now I still need to configure the webserver to serve it uncompressed from pre-compressed .html.gz [00:03:46] the codereview dump (no gzip) is ~4G. the gzipped tarball I have is ~500M [00:03:47] trying to say.. let's upload it to git either way [00:03:54] hmm [00:04:06] gitlab test case ..jk.. unless [00:04:27] well, there are other k8s images that size [00:04:49] let's see next year :) [00:04:58] (which is soon) [00:05:38] brennen: ? [00:06:11] legoktm: that's how I did it for bz-static https://gerrit.wikimedia.org/r/c/operations/puppet/+/424657/2/modules/profile/manifests/microsites/static_bugzilla.pp [00:06:30] ah [00:06:40] Juan_90264: deploying shortly [00:06:56] legoktm: and here is where I put all the HTML content for k8s version https://gerrit.wikimedia.org/r/c/operations/container/miscweb/+/730292 [00:07:34] but I am not sure yet if it should be a separate container [00:07:39] just because of size [00:07:54] the tiny microsites would normally just all go into that same repo [00:07:55] brennen: Okay waiting... [00:08:05] (03CR) 10Brennen Bearnes: [C: 03+2] Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) (owner: 10Juan90264) [00:08:18] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10Reedy) [00:08:47] (03Merged) 10jenkins-bot: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) (owner: 10Juan90264) [00:09:41] Great merged! [00:11:57] brennen: mwdebug1001 or 1002 [00:12:00] ? [00:13:45] Juan_90264: mwdebug1002 [00:13:53] Okay [00:14:01] (03CR) 10Dzahn: "@miscweb2002:/var/log/apache2# tail -f static-tendril.*" [puppet] - 10https://gerrit.wikimedia.org/r/747665 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:14:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:23] brennen: I tested and approved [00:15:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:25] !log brennen@deploy1002 Synchronized static/images/mobile/copyright/outreach-wordmark.svg: Config: [[gerrit:746919|Fix wordmark to outreachwiki (T297580)]] (duration: 00m 57s) [00:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:30] T297580: Outreach wiki uses Wikiquote logo in newer Desktop mode - https://phabricator.wikimedia.org/T297580 [00:17:05] Hm, I've realized that I'm not actually certain mine needs anything done for deployment. Skip me. [00:17:50] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:746919|Fix wordmark to outreachwiki (T297580)]] (duration: 00m 57s) [00:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:33] I've removed myself from the deployment page. I'll do some double-checking and see whether I actually need this, and add myself in on Monday if it's actually necessary. [00:18:49] Kemayo: ack, thanks. [00:18:58] Working on [00:19:09] Juan_90264: looks correct to me. [00:19:15] i've gone ahead and synced. [00:19:35] EricGardner: moving on to your patch [00:19:44] Ok. CI tends to take a while on MediaSearch [00:19:57] Now it's "Change logo in abwiki" [00:20:57] (03CR) 10Brennen Bearnes: [C: 03+2] Don't boot users with title="Special:MediaSearch" back to old search page [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747909 (https://phabricator.wikimedia.org/T297877) (owner: 10Eric Gardner) [00:24:30] (03PS1) 10Dzahn: static-tendril: add "require all granted" to fix 403 Forbidden [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) [00:25:29] (03CR) 10Brennen Bearnes: [C: 03+2] Change logo in abwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747687 (https://phabricator.wikimedia.org/T297810) (owner: 10Juan90264) [00:25:47] Juan_90264: will sync your abwiki change shortly [00:26:04] Okay [00:26:12] (03Merged) 10jenkins-bot: Change logo in abwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747687 (https://phabricator.wikimedia.org/T297810) (owner: 10Juan90264) [00:26:16] (03PS2) 10Dzahn: static-tendril: add "require all granted" to fix 403 Forbidden [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) [00:28:07] Juan_90264: please check mwdebug1002 [00:28:14] Okay [00:28:27] (03PS1) 10Clare Ming: Deploy sticky header to pilot wikis, launch A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) [00:30:05] brennen: I tested and approved [00:30:16] Juan_90264: thanks. syncing. [00:31:48] !log brennen@deploy1002 Synchronized static/images/project-logos/abwiki-1.5x.png: Config: [[gerrit:747687|Change logo in abwiki (T297810)]] (duration: 00m 57s) [00:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:53] T297810: Update logo for the Abkhazian Wikipedia - https://phabricator.wikimedia.org/T297810 [00:32:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:16] !log brennen@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:747687|Change logo in abwiki (T297810)]] (duration: 00m 57s) [00:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:42] Does anyone know how I get the "If you break AND fix the wikis, you will be rewarded with a sticker" sticker? Fixed wordmark from https://outreach.wikimedia.org [00:34:17] (03CR) 10Clare Ming: "Note: this should not be merged until greenlit by Olga (product manager) - just prepping config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [00:34:46] !log brennen@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:747687|Change logo in abwiki (T297810)]] (duration: 00m 56s) [00:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:51] Does anyone know how I get the "If you break AND fix the wikis, you will be rewarded with a sticker" sticker? Fixed wordmark from outreach.wikimedia.org [00:35:53] !log brennen@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:747687|Change logo in abwiki (T297810)]] (duration: 00m 57s) [00:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:22] (03PS3) 10Dzahn: static-tendril: add "require all granted" to fix 403 Forbidden [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) [00:37:22] ? [00:38:37] Juan_90264: make a ticket to request the sticker and I'll ask the people who run the shop if they have any :) [00:38:46] (03Merged) 10jenkins-bot: Don't boot users with title="Special:MediaSearch" back to old search page [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747909 (https://phabricator.wikimedia.org/T297877) (owner: 10Eric Gardner) [00:39:04] mutante: Okay [00:39:37] Looks like CI is complete for the MediaSearch patch [00:40:05] EricGardner: yep, syncing momentarily [00:40:21] Once it's on one of the debug servers I can test [00:41:03] (03PS4) 10Dzahn: static-tendril: add "require all granted" to fix 403 Forbidden [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) [00:42:44] EricGardner: on mwdebug1002 [00:42:50] ok, testing now [00:44:05] Looks good [00:44:09] works as expected [00:44:23] EricGardner: cool, syncing [00:44:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:55] (03CR) 10Dzahn: [C: 03+2] static-tendril: add "require all granted" to fix 403 Forbidden [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:45:36] RECOVERY - Device not healthy -SMART- on ms-be2065 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2065&var-datasource=codfw+prometheus/ops [00:45:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:45:55] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/MediaSearch/templates/SERPWidget.mustache: Backport: [[gerrit:747909|Don't boot users with title="Special:MediaSearch" back to old search page (T297877)]] (duration: 00m 57s) [00:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:00] T297877: MediaSearch tabs go to Special:Search when javascript is off - https://phabricator.wikimedia.org/T297877 [00:47:04] (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml --hosts miscweb2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:47:26] brennen: now seeing correct behavior on prod. Thanks! [00:47:33] (03CR) 10Dzahn: "[cumin1001:~] $ curl -H "Host: dbtree.wikimedia.org" https://miscweb2002.codfw.wmnet | grep still" [puppet] - 10https://gerrit.wikimedia.org/r/747980 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:48:38] EricGardner: cool, sure thing. [00:48:57] !log end of UTC late backport and config window [00:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:31] (03PS1) 10Dzahn: Revert "Revert "httpbb: add test for static-tendril to test_miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/747910 [00:53:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:58] (03PS2) 10Dzahn: Revert "Revert "httpbb: add test for static-tendril to test_miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/747910 [00:55:07] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "httpbb: add test for static-tendril to test_miscweb"" [puppet] - 10https://gerrit.wikimedia.org/r/747910 (owner: 10Dzahn) [01:08:03] (03CR) 10Dzahn: "not working yet and aware of it, needs to be added to a TLS cert, doing that tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/747910 (owner: 10Dzahn) [01:11:22] (03CR) 10Cwhite: [C: 03+1] hieradata: add more network probes for internal services [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [01:13:28] (03CR) 10Cwhite: [C: 03+1] prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [01:14:38] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [01:27:08] (03PS1) 10Legoktm: [WIP] Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 [01:27:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [01:29:57] (03CR) 10Legoktm: "The main motivation for this is https://gitlab.com/mwbot-rs/mwbot/-/merge_requests/25" [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [01:42:32] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T297927 (10cchen) [01:42:52] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10cchen) [02:07:04] !log depooling wtp1025 for benchmarking (T297259) [02:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:10] T297259: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 [02:10:34] (03PS1) 10Legoktm: Revert "Pretend mw1456 is a parsoid appserver for benchmarking" [puppet] - 10https://gerrit.wikimedia.org/r/747911 [02:13:08] (03CR) 10Legoktm: [C: 03+2] Revert "Pretend mw1456 is a parsoid appserver for benchmarking" [puppet] - 10https://gerrit.wikimedia.org/r/747911 (owner: 10Legoktm) [02:15:26] !log repooling mw1456 [02:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:13] PROBLEM - Device not healthy -SMART- on ms-be2065 is CRITICAL: cluster=swift device=sat+megaraid,14 instance=ms-be2065 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2065&var-datasource=codfw+prometheus/ops [03:02:53] (03PS1) 10Sharvaniharan: Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 [03:03:46] (03PS2) 10Sharvaniharan: Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) [03:03:58] (03PS1) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 [03:04:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (owner: 10Winston Sung) [03:05:53] (03CR) 10Sharvaniharan: "Hi @Ottomata @Jason Linehan, Please review the new changes to the config for android schema android.customize_toolbar_interaction. Thank y" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan) [03:12:38] (03PS2) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [03:12:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [03:13:05] (03PS3) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [03:13:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [03:13:58] (03PS1) 10Sharvaniharan: Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 [03:15:08] (03PS2) 10Sharvaniharan: Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) [03:16:00] (03CR) 10Sharvaniharan: "Hi @Ottomata @Jason Linehan please review these changes for ios.notification_interaction. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [03:20:32] (03PS4) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [03:20:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [03:29:00] (03PS1) 10Andrew Bogott: Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/747994 (https://phabricator.wikimedia.org/T297814) [03:29:30] (03PS1) 10Andrew Bogott: make cloudmetrics1004 the primary cloudmetrics endpoint [dns] - 10https://gerrit.wikimedia.org/r/747995 (https://phabricator.wikimedia.org/T297814) [03:30:07] (03CR) 10Andrew Bogott: [C: 03+2] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/747994 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [03:30:37] (03CR) 10Andrew Bogott: [C: 03+2] make cloudmetrics1004 the primary cloudmetrics endpoint [dns] - 10https://gerrit.wikimedia.org/r/747995 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [03:30:41] (03PS2) 10Andrew Bogott: make cloudmetrics1004 the primary cloudmetrics endpoint [dns] - 10https://gerrit.wikimedia.org/r/747995 (https://phabricator.wikimedia.org/T297814) [03:35:43] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:55] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 31, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:50:04] (03PS1) 10Winston Sung: Merge branch 'master' of f3d235d into gerrit 747912 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747997 [03:50:16] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'master' of f3d235d into gerrit 747912 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747997 (owner: 10Winston Sung) [03:54:03] (03PS5) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [03:54:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [03:56:27] (03PS6) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:00:14] (03PS7) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:00:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:01:29] (03Abandoned) 10Winston Sung: Merge branch 'master' of f3d235d into gerrit 747912 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747997 (owner: 10Winston Sung) [04:02:40] (03PS8) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:02:50] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:03:47] (03PS9) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:03:56] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:04:23] (03PS10) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:04:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:08:57] (03PS11) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) [04:09:07] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:11:48] (03Abandoned) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:29:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:09] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:32:21] (03CR) 10Winston Sung: [C: 03+1] Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:33:17] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:34:01] (03Restored) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (https://phabricator.wikimedia.org/T165593) (owner: 10Winston Sung) [04:34:09] (03PS12) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 [04:34:15] (03Abandoned) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" for Ib083a8ff042daa9bdd30d6a1e8c34f85b500fc12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747912 (owner: 10Winston Sung) [04:35:01] RECOVERY - Device not healthy -SMART- on ms-be2065 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2065&var-datasource=codfw+prometheus/ops [05:04:37] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RKemper) Brian and I will pair tomorrow on making the various puppet patches for the access request. (Also we should have approval from gehel tomorrow) [06:06:55] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:49] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:24:30] (03CR) 10ArielGlenn: "You might be better off to add a second class SiteInfoV2Dump derived from the existing one, and just override the build_commsnd method in " [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [06:26:47] (03CR) 10ArielGlenn: "The problem here is that the only days we can do maintenance on the dumpsdata or snapshot hosts involved in the runs of all the dumps run " [puppet] - 10https://gerrit.wikimedia.org/r/747879 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [06:38:24] (03PS2) 10Legoktm: Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 [06:38:53] (03CR) 10Legoktm: Add siteinfo data in formatversion=2 too (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [06:55:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: inject x-client-ip from envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/747838 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [06:59:02] (03Merged) 10jenkins-bot: mediawiki: inject x-client-ip from envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/747838 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [07:00:48] (03PS1) 10Elukey: install_server: set the reuse recipe for all kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/747999 (https://phabricator.wikimedia.org/T296641) [07:01:55] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:12] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:49] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:10] (03PS1) 10Giuseppe Lavagetto: tls_helpers: fix access to variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/748001 [07:11:18] (03CR) 10jerkins-bot: [V: 04-1] tls_helpers: fix access to variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/748001 (owner: 10Giuseppe Lavagetto) [07:12:21] (03PS2) 10Giuseppe Lavagetto: tls_helpers: fix access to variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/748001 [07:15:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:49] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:24:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tls_helpers: fix access to variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/748001 (owner: 10Giuseppe Lavagetto) [07:27:21] (03Merged) 10jenkins-bot: tls_helpers: fix access to variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/748001 (owner: 10Giuseppe Lavagetto) [07:28:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:22] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:43] (03PS1) 10Marostegui: jynus,kormat.bashrc: Replace mysql.py with db-mysql [puppet] - 10https://gerrit.wikimedia.org/r/748064 (https://phabricator.wikimedia.org/T297618) [07:39:27] (03CR) 10Marostegui: "Feel free to merge this patch or create your own at your own convenience to replace mysql.py" [puppet] - 10https://gerrit.wikimedia.org/r/748064 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [07:39:49] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [07:46:21] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10akosiaris) [07:46:30] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10akosiaris) p:05Triageβ†’03Medium [07:46:50] (03PS1) 10Marostegui: events_eventlogging.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/748065 [07:47:26] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10akosiaris) 05Openβ†’03Stalled Thanks for creating this followup ticket. Stalling until early Janua... [07:47:32] (03CR) 10Marostegui: [C: 03+2] events_eventlogging.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/748065 (owner: 10Marostegui) [07:48:03] (03Merged) 10jenkins-bot: events_eventlogging.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/748065 (owner: 10Marostegui) [07:49:50] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10Gehel) Approved! [07:50:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10Gehel) Approved! [07:52:25] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RhinosF1) Isn't the 'ops' LDAP group only supposed to be for people in the 'ops' shell group? [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211217T0800) [08:01:29] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10Gehel) >>! In T297910#7577051, @RhinosF1 wrote: > Isn't the 'ops' LDAP group only supposed to be for people in the 'ops' shell group? @bking is an SRE in the Search Pla... [08:05:25] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:08] good morning [08:09:24] Hey hashar [08:09:27] Happy Friday! [08:15:30] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10Gehel) After discussion with @RhinosF1 : * there are a number of other peoples who are in both ops and analytics-privatedata-users, so that's probably not an issue * we... [08:17:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [08:18:02] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RhinosF1) [08:18:40] (03PS3) 10Filippo Giunchedi: prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) [08:18:42] (03PS3) 10Filippo Giunchedi: hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) [08:20:11] (03CR) 10Filippo Giunchedi: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:20:31] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10elukey) >>! In T297910#7577062, @Gehel wrote: > After discussion with @RhinosF1 : > > * there are a number of other peoples who are in both ops and analytics-privatedat... [08:20:47] (03CR) 10Elukey: [C: 03+2] install_server: set the reuse recipe for all kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/747999 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [08:25:26] (03CR) 10JMeybohm: [C: 03+1] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [08:26:12] (03CR) 10Elukey: kserve-inference: allow the definition of tranformers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 (owner: 10Elukey) [08:26:24] elukey: so analytics-users is needed on top of ops (with kerbos) [08:27:52] (03CR) 10JMeybohm: [C: 03+1] Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [08:29:29] RhinosF1: analytics-privatedata-users, if needed, yes [08:29:38] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10Peachey88) [08:29:49] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:30:50] elukey: thanks! [08:31:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RhinosF1) [08:31:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:36] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RhinosF1) Thanks @elukey [08:32:52] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RhinosF1) a:03bking As task says they will upload patches [08:34:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:01] (03PS1) 10Giuseppe Lavagetto: _tls_helpers: properly quote header values [deployment-charts] - 10https://gerrit.wikimedia.org/r/748067 [08:40:49] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] _tls_helpers: properly quote header values [deployment-charts] - 10https://gerrit.wikimedia.org/r/748067 (owner: 10Giuseppe Lavagetto) [08:43:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:07] (03Merged) 10jenkins-bot: _tls_helpers: properly quote header values [deployment-charts] - 10https://gerrit.wikimedia.org/r/748067 (owner: 10Giuseppe Lavagetto) [08:46:16] (03PS3) 10Elukey: kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 [08:47:05] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:49] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:49:21] (03CR) 10Jelto: [C: 03+2] helmfile.d/admin_ng: change ci deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/747814 (https://phabricator.wikimedia.org/T297809) (owner: 10Jelto) [08:49:45] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:02] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:38] (03Merged) 10jenkins-bot: helmfile.d/admin_ng: change ci deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/747814 (https://phabricator.wikimedia.org/T297809) (owner: 10Jelto) [08:52:53] (03Merged) 10jenkins-bot: helmfile.d/admin_ng: fix subjects of rolebinding in namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/747819 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:01:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:24] !log set sdq as offline, showing errors. megacli -PDOffline -PhysDrv '[32:14]' -aALL [09:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:37] that will generate a task shortly [09:03:38] godog: host? [09:04:20] oops! you are right RhinosF1 [09:04:32] !log previous message refers to ms-be2065 [09:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:43] godog: np! [09:07:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:59] good morning [09:08:05] Hi [09:09:06] jelto: I see you did some charts changes, Is deploy1002 you? [09:09:17] I think I have found a bug in VisualEditor :D [09:09:50] regardless I have been running the train this week and I am here today to deal with aftermaths if anything is needed [09:09:51] What bug [09:10:03] I will report the VE bug [09:10:12] digging in Phabricator first to see if it got reported previously [09:10:32] "There is no section X in revision yyy" when trying to edit the wikicode for a section ;) [09:10:55] https://phabricator.wikimedia.org/T294642 hehe [09:11:03] Yeah I've heard that before [09:12:19] RhinosF1: I did quite usual changes, I did not expect some alert. So "no" not expected [09:13:39] jelto: looks actually like it alerted before your change so can't be that [09:16:29] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The following units failed: session-133596.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [09:23:09] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2002.codfw.wmnet with OS buster [09:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:15] (03CR) 10Arturo Borrero Gonzalez: "The commit message suggests some controversy, but I don't get it. What's the problem with this patch? Do you have concerns that it will so" [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [09:36:08] 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10fgiunchedi) [09:37:44] !log restart blazegraph on wdqs1007 (jvm stuck for 6hours) [09:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] (03PS1) 10MVernon: admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) [09:38:51] (03CR) 10jerkins-bot: [V: 04-1] admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) (owner: 10MVernon) [09:40:58] (03PS2) 10MVernon: admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) [09:43:21] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:36] RhinosF1: ^ found the issue and fixed it :) automatic sync of git repo was blocked by local unstaged changes [09:44:51] jelto: cool! [09:49:37] (03PS1) 10MVernon: admin: add mmunyoki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/748075 (https://phabricator.wikimedia.org/T297842) [09:50:00] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:02] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2002.codfw.wmnet with OS buster [09:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10MatthewVernon) [09:55:21] (03CR) 10Kormat: [C: 03+1] admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) (owner: 10MVernon) [09:56:00] (03CR) 10Kormat: [C: 03+1] admin: add mmunyoki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/748075 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [09:57:17] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10homer, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05Stalledβ†’03In progress Finally merged! [09:57:40] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:16] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The following units failed: session-134390.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:41] (03Abandoned) 10Lucas Werkmeister (WMDE): logspam: Consolidate another kind of OOM messages [puppet] - 10https://gerrit.wikimedia.org/r/747102 (owner: 10Lucas Werkmeister (WMDE)) [10:01:19] (03PS1) 10MVernon: admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) [10:03:59] (03CR) 10Filippo Giunchedi: [C: 04-1] prometheus: add blackbox generic http/s static check support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [10:04:58] (03CR) 10MVernon: [C: 03+2] admin: add mmunyoki to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/748075 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [10:05:21] (03PS1) 10Ayounsi: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) [10:06:14] (03CR) 10DCausse: [C: 03+1] sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [10:06:25] (03CR) 10Ayounsi: "Example diff:" [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [10:07:35] (03CR) 10Kormat: [C: 03+1] admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) (owner: 10MVernon) [10:10:50] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: session-251129.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:39] (03PS1) 10MVernon: add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) [10:17:29] (03CR) 10jerkins-bot: [V: 04-1] add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [10:17:41] Amir1: if operations/software is being added is the explicit conftool above still required? [10:19:43] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Codfw commonswiki backups are at 75% completion (68854627 files/301887395014767 bytes backed up), and will likely fini... [10:26:04] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1010.eqiad.wmnet [10:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:21] (03PS4) 10Elukey: kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 [10:28:50] (03CR) 10jerkins-bot: [V: 04-1] kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 (owner: 10Elukey) [10:29:32] whatt [10:32:16] "Error: 1 chart(s) linted, 1 chart(s) failed" [10:32:19] that's not very helpful [10:32:47] nono it is a pebcak, completely right [10:32:52] I am fixing it :) [10:33:03] (used helm3 lint to find the issue) [10:34:25] (03PS5) 10Elukey: kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 [10:37:59] RhinosF1: no, that's a different repo. [10:38:33] Amir1: oh, I thought it was like mediawiki/extensions where it picked up the sub repos [10:38:56] no, picking up subrepos is actually a complicated thing [10:39:29] yeah they're so slippery, i always drop some [10:39:31] Ah :) [10:42:39] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:13] (03PS1) 10Majavah: hieradata: Empty kubernetes_cluster_groups on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/748092 (https://phabricator.wikimedia.org/T297853) [10:53:49] (03PS2) 10MVernon: add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) [10:54:44] (03CR) 10jerkins-bot: [V: 04-1] add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [10:56:32] (03PS4) 10Urbanecm: [DNM] snapshot: Dump information about Growth mentorship [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) [10:57:52] (03PS3) 10MVernon: add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) [10:58:50] (03CR) 10jerkins-bot: [V: 04-1] add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [11:01:32] (03PS4) 10MVernon: add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) [11:01:46] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) [11:02:41] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:29] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:35] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:04] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [11:05:06] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [11:05:08] (03PS6) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [11:06:34] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [11:07:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [11:07:54] (03CR) 10MVernon: [C: 03+2] add mmunyoki to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/748082 (https://phabricator.wikimedia.org/T297842) (owner: 10MVernon) [11:08:09] (03PS7) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [11:08:14] (03PS1) 10Inductiveload: Move horizontal/vertical layout to CSS only [extensions/ProofreadPage] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/748095 (https://phabricator.wikimedia.org/T297339) [11:09:00] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [11:10:14] (03CR) 10MVernon: [C: 03+2] admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) (owner: 10MVernon) [11:10:27] (03PS3) 10MVernon: admin: Add new user komla to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/748074 (https://phabricator.wikimedia.org/T297621) [11:10:34] (03PS1) 10Joal: Add network_internal_flows to refine and druid-load [puppet] - 10https://gerrit.wikimedia.org/r/748097 (https://phabricator.wikimedia.org/T263277) [11:10:54] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MatthewVernon) 05In progressβ†’03Resolved a:03MatthewVernon This is now done (I suggest waiting an hour or so for changes to fully propagate). [11:12:15] (03CR) 10Joal: "@Arzhel: can you please check that the columns in the druid job look ok to you?" [puppet] - 10https://gerrit.wikimedia.org/r/748097 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [11:13:58] (03PS5) 10Urbanecm: [DNM] snapshot: Dump information about Growth mentorship [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) [11:14:22] (03PS6) 10Urbanecm: snapshot: Dump information about Growth mentorship [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) [11:14:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10MatthewVernon) 05In progressβ†’03Resolved a:03MatthewVernon This is now complete (I'd wait an hour for the change to fully propagate, though). [11:16:23] (03CR) 10MVernon: [C: 03+2] admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) (owner: 10MVernon) [11:18:08] (03PS2) 10Ayounsi: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) [11:18:10] !log updating reprepro with new druid packages for buster-wikimedia to pick up new log4j jar files [11:18:10] (03PS1) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) [11:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:27] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [11:20:38] (03PS2) 10MVernon: admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) [11:23:10] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Adding question here in addition to the CR: For druid ingestion we have 2 jobs, the first ingests all colu... [11:24:10] (03PS1) 10Majavah: mediawiki: Move mwmaint system::roles to the role [puppet] - 10https://gerrit.wikimedia.org/r/748101 [11:24:13] (03PS6) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [11:24:15] (03PS6) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [11:24:17] (03PS8) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [11:24:47] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10MatthewVernon) >>! In T297910#7577009, @RKemper wrote: > Brian and I will pair tomorrow on making the various puppet patches for the access request. > > (Also we should... [11:24:49] (03CR) 10Ayounsi: "Example diff:" [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:25:10] (03CR) 10Kormat: [C: 03+1] admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) (owner: 10MVernon) [11:25:28] (03CR) 10MVernon: [C: 03+2] admin: add rkemper to analytics-privatedata-users with krb [puppet] - 10https://gerrit.wikimedia.org/r/748077 (https://phabricator.wikimedia.org/T297908) (owner: 10MVernon) [11:25:40] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [11:31:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10MatthewVernon) 05In progressβ†’03Resolved a:03MatthewVernon Done. You should have had an email about your new kerberos principal. [11:32:55] (03PS7) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [11:32:57] (03PS7) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [11:32:59] (03PS9) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [11:34:11] (03PS7) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [11:34:31] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect/cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [11:34:44] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10MatthewVernon) [11:35:05] (03CR) 10Jbond: P:environment: Add a simple zshrc file to the home dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [11:38:37] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10MatthewVernon) I think this needs manager approval from @mwilliams and group-membership approval from @Ottomata or @odimitrijevic [11:39:44] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10MatthewVernon) p:05Triageβ†’03Medium [11:42:05] !log Upgrading druid packages on an-druid1001. [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:55] (03PS10) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce more code to cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [11:45:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [11:51:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 (owner: 10Arturo Borrero Gonzalez) [11:51:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 (owner: 10Arturo Borrero Gonzalez) [11:51:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid-configurator: introduce more code to cleanup dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [11:55:43] (03PS3) 10Jbond: P:puppet_compiler: update workers to use shared puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/747899 [11:55:51] (03PS22) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [11:56:02] (03PS16) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [12:00:53] (03CR) 10Jbond: [C: 03+2] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [12:02:33] (03PS7) 10Urbanecm: snapshot: Dump information about Growth mentorship [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) [12:03:34] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [12:07:04] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet,service=druid-public-broker [12:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:50] (03PS1) 10Jbond: puppet_compiler: Add timer [puppet] - 10https://gerrit.wikimedia.org/r/748109 [12:16:04] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add timer [puppet] - 10https://gerrit.wikimedia.org/r/748109 (owner: 10Jbond) [12:16:40] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: Add timer [puppet] - 10https://gerrit.wikimedia.org/r/748109 (owner: 10Jbond) [12:19:53] (03PS2) 10Jbond: puppet_compiler: Add timer [puppet] - 10https://gerrit.wikimedia.org/r/748109 [12:20:46] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Add timer [puppet] - 10https://gerrit.wikimedia.org/r/748109 (owner: 10Jbond) [12:23:24] (03PS1) 10Jbond: pcc_uploader: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/748110 [12:23:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc_uploader: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/748110 (owner: 10Jbond) [12:25:06] (03PS2) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) [12:25:08] (03PS1) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [12:26:49] (03PS1) 10Jbond: pupppet_compiler: run timer as jenkins-deploy [puppet] - 10https://gerrit.wikimedia.org/r/748112 [12:27:11] (03CR) 10Jbond: [C: 03+2] pupppet_compiler: run timer as jenkins-deploy [puppet] - 10https://gerrit.wikimedia.org/r/748112 (owner: 10Jbond) [12:28:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Move mwmaint system::roles to the role [puppet] - 10https://gerrit.wikimedia.org/r/748101 (owner: 10Majavah) [12:29:34] (03CR) 10Ayounsi: "Example diff:" [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [12:31:29] !log Upgraded druid packages on druid1004 [12:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:44] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1004.eqiad.wmnet,service=druid-public-broker [12:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:51] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: / 2081 MB (3% inode=97%): /tmp 2081 MB (3% inode=97%): /var/tmp 2081 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [12:32:53] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host bast6001.wikimedia.org [12:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:57] (03PS2) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [12:37:58] !log upgrading druid packages on an-druid1002 [12:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:06] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast6001.wikimedia.org [12:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:15] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1005.eqiad.wmnet,service=druid-public-broker [12:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:40] (03PS1) 10Jelto: gitlab_runner: disable disk check for docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) [12:43:43] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host bast6001.wikimedia.org [12:43:46] !log Upgraded druid packages on an-druid1002. [12:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] !log Upgraded druid packages on druid1005. [12:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:09] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1005.eqiad.wmnet,service=druid-public-broker [12:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:28] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33047/console" [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:47:03] (03PS2) 10Jelto: gitlab_runner: disable disk check for docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) [12:47:41] !log Upgrading druid packages on an-druid1003 [12:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:34] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: handle more dead config directories [puppet] - 10https://gerrit.wikimedia.org/r/748116 [12:48:35] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1006.eqiad.wmnet,service=druid-public-broker [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:15] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: handle more dead config directories [puppet] - 10https://gerrit.wikimedia.org/r/748116 (owner: 10Arturo Borrero Gonzalez) [12:50:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33048/console" [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:50:26] !log upgrading druid packages on druid1006 [12:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:51] (03PS3) 10Jelto: gitlab_runner: disable disk check for docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) [12:51:12] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1006.eqiad.wmnet,service=druid-public-broker [12:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:31] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10akosiaris) >>! In T297307#7576437, @Dzahn wrote: >>>! In T297307#7574634, @akosiaris wrote: >> I am inclined to resolve this task, but I... [12:52:45] !log upgrading druid packages on an-druid1004 [12:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:23] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: / 1838 MB (3% inode=97%): /tmp 1838 MB (3% inode=97%): /var/tmp 1838 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [12:53:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33049/console" [puppet] - 10https://gerrit.wikimedia.org/r/748114 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:53:50] !log milimetric@deploy1002 Started deploy [analytics/refinery@e9f04c3]: Fix sanitize allowlist problem [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:26] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1007.eqiad.wmnet,service=druid-public-broker [12:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] !log upgrading druid packages on druid1007 [12:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:38] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1007.eqiad.wmnet,service=druid-public-broker [12:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:32] !log upgraded druid packages on an-druid1005 [13:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:29] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1008.eqiad.wmnet,service=druid-public-broker [13:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:52] !log upgraded druid packages on druid1008 [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:10] (03PS1) 10Jbond: puppet_compiler::uploader: move process facts file to webroot [puppet] - 10https://gerrit.wikimedia.org/r/748124 [13:04:48] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1008.eqiad.wmnet,service=druid-public-broker [13:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:26] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast6001.wikimedia.org [13:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:08:37] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) I believe that it is not necessary to refine this data. [13:11:10] (03PS2) 10Jbond: puppet_compiler::uploader: move process facts file to webroot [puppet] - 10https://gerrit.wikimedia.org/r/748124 [13:12:39] (03CR) 10Jbond: [C: 03+2] puppet_compiler::uploader: move process facts file to webroot [puppet] - 10https://gerrit.wikimedia.org/r/748124 (owner: 10Jbond) [13:15:06] (03PS1) 10BBlack: bast6001: set dhcp macaddr for ganeti vm [puppet] - 10https://gerrit.wikimedia.org/r/748125 (https://phabricator.wikimedia.org/T282787) [13:15:56] (03CR) 10BBlack: [C: 03+2] bast6001: set dhcp macaddr for ganeti vm [puppet] - 10https://gerrit.wikimedia.org/r/748125 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [13:16:24] (03PS1) 10Jbond: puppet_compiler: use correct jenkis group [puppet] - 10https://gerrit.wikimedia.org/r/748126 [13:16:47] (03CR) 10Jbond: [C: 03+2] puppet_compiler: use correct jenkis group [puppet] - 10https://gerrit.wikimedia.org/r/748126 (owner: 10Jbond) [13:17:26] bblack: happy for me tpo merge yours> [13:17:30] bast6001: set dhcp macaddr for ganeti vm (43737d5620) [13:17:57] jbond: please do [13:18:15] done [13:18:27] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:18:35] thanks! [13:18:45] np :) [13:20:09] !log milimetric@deploy1002 Finished deploy [analytics/refinery@e9f04c3]: Fix sanitize allowlist problem (duration: 26m 19s) [13:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] !log milimetric@deploy1002 Started deploy [analytics/refinery@e9f04c3] (thin): Fix sanitize allowlist problem [THIN] [13:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:31] !log milimetric@deploy1002 Finished deploy [analytics/refinery@e9f04c3] (thin): Fix sanitize allowlist problem [THIN] (duration: 00m 07s) [13:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:37] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:22:25] (03CR) 10Jbond: "this looks good to me but can we also confirm (either here or on task) that we have updated or confirmed the configuration upstream https:" [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [13:25:01] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) In theory there should not be any PII data, but it would be safer to sanitize is nonetheless. As the data is m... [13:25:29] !log milimetric@deploy1002 Started deploy [analytics/refinery@e9f04c3] (hadoop-test): Fix sanitize allowlist problem [TEST] [13:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:40] (03PS1) 10Jbond: puppet_compiler: fix call to unlink [puppet] - 10https://gerrit.wikimedia.org/r/748127 [13:29:15] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix call to unlink [puppet] - 10https://gerrit.wikimedia.org/r/748127 (owner: 10Jbond) [13:32:54] (03PS1) 10Jbond: puppet_compiler: fix group [puppet] - 10https://gerrit.wikimedia.org/r/748128 [13:33:38] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix group [puppet] - 10https://gerrit.wikimedia.org/r/748128 (owner: 10Jbond) [13:41:38] (03PS1) 10Jbond: puppet_compiler::upload: tar_file is actully a Path [puppet] - 10https://gerrit.wikimedia.org/r/748129 [13:42:19] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler::upload: tar_file is actully a Path [puppet] - 10https://gerrit.wikimedia.org/r/748129 (owner: 10Jbond) [13:43:21] (03PS2) 10Jbond: puppet_compiler::upload: tar_file is actully a Path [puppet] - 10https://gerrit.wikimedia.org/r/748129 [13:44:06] (03CR) 10Jbond: [C: 03+2] puppet_compiler::upload: tar_file is actully a Path [puppet] - 10https://gerrit.wikimedia.org/r/748129 (owner: 10Jbond) [13:44:28] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Fokebox) [13:46:36] 10SRE, 10observability, 10service-runner, 10serviceops-radar: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) >>! In T222795#7571702, @fgiunchedi wrote: > I believe this is now (partially?) done, and service-runner s... [13:59:39] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: handle more dead config directories [puppet] - 10https://gerrit.wikimedia.org/r/748116 [13:59:41] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: delete config from datastore [puppet] - 10https://gerrit.wikimedia.org/r/748132 [14:00:17] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: handle more dead config directories [puppet] - 10https://gerrit.wikimedia.org/r/748116 (owner: 10Arturo Borrero Gonzalez) [14:00:37] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: delete config from datastore [puppet] - 10https://gerrit.wikimedia.org/r/748132 (owner: 10Arturo Borrero Gonzalez) [14:07:11] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 [14:07:56] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: delete more dead config [puppet] - 10https://gerrit.wikimedia.org/r/748132 (owner: 10Arturo Borrero Gonzalez) [14:21:45] (03CR) 10Michael Große: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [14:22:02] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2001.codfw.wmnet with OS buster [14:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:11] !log milimetric@deploy1002 Finished deploy [analytics/refinery@e9f04c3] (hadoop-test): Fix sanitize allowlist problem [TEST] (duration: 69m 41s) [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:42] (03PS1) 10Bking: admin: add bking to shell users Bug: T297910 [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) [14:43:35] (03CR) 10jerkins-bot: [V: 04-1] admin: add bking to shell users Bug: T297910 [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [14:45:01] (03PS2) 10Bking: admin: add bking to shell users [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) [14:50:05] inflatador: you haven't actually put yourself in any groups [14:50:42] (03PS1) 10Bking: admin: Add Brian King to ops and analytics_privatedata_users groups [puppet] - 10https://gerrit.wikimedia.org/r/748137 (https://phabricator.wikimedia.org/T297910) [14:51:43] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/748137 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [14:51:50] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [14:52:14] RhinosF1 just added another commit for that [14:52:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2001.codfw.wmnet with OS buster [14:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:59] inflatador: I saw, it's pushed them as 2 separate changes so whoever merges will have to check the order [14:55:12] (03CR) 10RhinosF1: [C: 03+1] admin: Add Brian King to ops and analytics_privatedata_users groups [puppet] - 10https://gerrit.wikimedia.org/r/748137 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [14:55:22] (03CR) 10RhinosF1: [C: 03+1] admin: add bking to shell users [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [14:56:45] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10mwilliams) Approved! [15:03:28] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10MatthewVernon) [15:15:57] (03PS1) 10Andrew Bogott: cinder backups: dump backup config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/748140 (https://phabricator.wikimedia.org/T294429) [15:16:48] !log milimetric@deploy1002 Started deploy [analytics/refinery@5c3bce1]: Fix refine sanitize allowlist, remove mediawiki_skin_diff schema for now [15:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:14] (03PS2) 10Andrew Bogott: cinder backups: dump backup config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/748140 (https://phabricator.wikimedia.org/T294429) [15:20:12] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: dump backup config to yaml [puppet] - 10https://gerrit.wikimedia.org/r/748140 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [15:26:17] (03PS1) 10JMeybohm: Upgrade simple-cfssl to forked version wmf-dev [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748141 (https://phabricator.wikimedia.org/T294560) [15:26:19] (03PS1) 10JMeybohm: Use vendored dependencies for docker builds from source tree [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748142 (https://phabricator.wikimedia.org/T294560) [15:26:21] (03PS1) 10JMeybohm: Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) [15:27:49] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Upgrade simple-cfssl to forked version wmf-dev [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748141 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:28:12] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Use vendored dependencies for docker builds from source tree [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748142 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:30:22] (03CR) 10Elukey: [C: 03+2] kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 (owner: 10Elukey) [15:33:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [15:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:11] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [15:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:46] (03PS2) 10JMeybohm: Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) [15:41:09] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:27] (03PS1) 10Accraze: ml-services: add articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/748147 (https://phabricator.wikimedia.org/T294141) [15:52:33] PROBLEM - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:55] (03PS2) 10Accraze: ml-services: add articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/748147 (https://phabricator.wikimedia.org/T294141) [16:01:58] 10SRE-swift-storage, 10Observability-Metrics, 10serviceops: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10fgiunchedi) [16:02:01] RECOVERY - Check systemd state on thanos-be2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:29] RECOVERY - configured eth on ganeti6004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:02:31] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [16:02:32] !log root@thanos-be2004:/srv/log/swift# rm server.log.1 - T297959 [16:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:39] T297959: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 [16:03:41] 10SRE, 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10Papaul) p:05Triageβ†’03Medium [16:04:34] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: get openstack creds from novaadmin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/748149 (https://phabricator.wikimedia.org/T294429) [16:05:00] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [16:05:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: get openstack creds from novaadmin.yaml [puppet] - 10https://gerrit.wikimedia.org/r/748149 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [16:07:13] (03PS1) 10Filippo Giunchedi: swift: temp ban for tegola access logs [puppet] - 10https://gerrit.wikimedia.org/r/748150 (https://phabricator.wikimedia.org/T297959) [16:08:17] (03CR) 10Filippo Giunchedi: "Sample log" [puppet] - 10https://gerrit.wikimedia.org/r/748150 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [16:10:10] (03PS2) 10Filippo Giunchedi: swift: temp ban for tegola access logs [puppet] - 10https://gerrit.wikimedia.org/r/748150 (https://phabricator.wikimedia.org/T297959) [16:13:59] RECOVERY - configured eth on ganeti6001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:14:19] RECOVERY - configured eth on ganeti6003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:14:41] RECOVERY - configured eth on ganeti6002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:15:35] (03PS1) 10BBlack: bast6001: add to bastion_hosts [puppet] - 10https://gerrit.wikimedia.org/r/748151 (https://phabricator.wikimedia.org/T282787) [16:17:05] RECOVERY - Check systemd state on thanos-be2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:45] (03CR) 10BBlack: [C: 03+2] bast6001: add to bastion_hosts [puppet] - 10https://gerrit.wikimedia.org/r/748151 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [16:22:30] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [16:26:35] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/748150 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [16:26:37] !log milimetric@deploy1002 Finished deploy [analytics/refinery@5c3bce1]: Fix refine sanitize allowlist, remove mediawiki_skin_diff schema for now (duration: 69m 48s) [16:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:12] (03PS1) 10Elukey: kserve-inference: add network policies for transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/748153 [16:28:52] !log reboot bast6001 (downtimed) [16:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:10] !log milimetric@deploy1002 Started deploy [analytics/refinery@5c3bce1] (thin): Fix refine sanitize allowlist, remove mediawiki_skin_diff schema for now [THIN] [16:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:17] !log milimetric@deploy1002 Finished deploy [analytics/refinery@5c3bce1] (thin): Fix refine sanitize allowlist, remove mediawiki_skin_diff schema for now [THIN] (duration: 00m 07s) [16:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:59] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: temp ban for tegola access logs [puppet] - 10https://gerrit.wikimedia.org/r/748150 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [16:32:49] (03CR) 10Elukey: [C: 03+2] kserve-inference: add network policies for transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/748153 (owner: 10Elukey) [16:33:21] (03CR) 10Accraze: [C: 03+1] kserve-inference: add network policies for transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/748153 (owner: 10Elukey) [16:33:39] !log remove /var/log/swift/server.log.1 from thanos-be* - T297959 [16:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:44] T297959: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 [16:37:17] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: support user-requested full backups [puppet] - 10https://gerrit.wikimedia.org/r/748158 (https://phabricator.wikimedia.org/T294429) [16:39:10] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: support user-requested full backups [puppet] - 10https://gerrit.wikimedia.org/r/748158 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [16:39:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:17] 10SRE-swift-storage, 10Observability-Metrics, 10serviceops, 10Patch-For-Review: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10fgiunchedi) I've bandaided the immediate issue, leaving the task open since we haven't addressed the high volume of logs [16:44:42] !log ganeti6003 - rebooting [16:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:29] (03PS1) 10Elukey: kserve-inference: fix network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/748159 [16:53:59] !log bast6001: shutdown->start [16:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:28] (03CR) 10Accraze: [C: 03+1] kserve-inference: fix network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/748159 (owner: 10Elukey) [16:54:42] (03CR) 10Elukey: [C: 03+2] kserve-inference: fix network policy template [deployment-charts] - 10https://gerrit.wikimedia.org/r/748159 (owner: 10Elukey) [16:56:20] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:56:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:08] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] !log milimetric@deploy1002 Started deploy [analytics/refinery@0778d1e]: Proper fix for mediawiki_skin_diff [16:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:56] (03PS6) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [17:01:27] (03CR) 10Cwhite: role: add apifeatureusage role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [17:01:43] (03CR) 10SBassett: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [17:05:58] (03CR) 10Elukey: [C: 03+2] ml-services: add articlequality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/748147 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [17:07:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [17:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:14] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:17:56] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:19:28] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:20:08] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:20:34] !log milimetric@deploy1002 Finished deploy [analytics/refinery@0778d1e]: Proper fix for mediawiki_skin_diff (duration: 20m 45s) [17:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:26] !log bast6001: shutdown->start (again) [17:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:18] !log milimetric@deploy1002 Started deploy [analytics/refinery@0778d1e] (thin): Proper fix for mediawiki_skin_diff [THIN] [17:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:24] !log milimetric@deploy1002 Finished deploy [analytics/refinery@0778d1e] (thin): Proper fix for mediawiki_skin_diff [THIN] (duration: 00m 06s) [17:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:23] (03CR) 10Lucas Werkmeister (WMDE): wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [17:32:31] The yearly calendar (https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar) says there'll be no deploys at all until January 4th, but the regular deployments page has backport windows still listed for next week -- which is right? [17:32:58] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Legoktm) @Fokebox can you clarify which [[https://meta.wikimedia.org/wiki/Wikimedia_movement_affiliates|Wikimedia Affiliate]] is backing/supporting this project? [17:33:21] (03PS1) 10Majavah: dynamicproxy: enforce project permissions [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) [17:33:58] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: enforce project permissions [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [17:35:25] (03PS2) 10Majavah: dynamicproxy: enforce project permissions [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) [17:36:01] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: enforce project permissions [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [17:37:14] (03PS3) 10Majavah: dynamicproxy: enforce project permissions [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) [17:39:57] (03PS1) 10Elukey: helmfile.d: fix the ml-serve's articlequality transformer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/748172 [17:41:14] (03CR) 10Accraze: [C: 03+1] helmfile.d: fix the ml-serve's articlequality transformer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/748172 (owner: 10Elukey) [17:43:41] !log bblack@cumin1001 START - Cookbook sre.ganeti.makevm for new host install6001.wikimedia.org [17:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:03] (03PS1) 10BBlack: install6001: add site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/748174 (https://phabricator.wikimedia.org/T282787) [17:49:16] (03PS1) 10BBlack: install6001: use for drmrs installs [puppet] - 10https://gerrit.wikimedia.org/r/748175 (https://phabricator.wikimedia.org/T282787) [17:51:37] (03CR) 10Andrew Bogott: [C: 03+2] "Looks great -- thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/748171 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [17:51:39] (03PS1) 10BBlack: install6001: use as proxy for drmrs [dns] - 10https://gerrit.wikimedia.org/r/748178 (https://phabricator.wikimedia.org/T282787) [17:51:54] (03CR) 10Elukey: [C: 03+2] helmfile.d: fix the ml-serve's articlequality transformer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/748172 (owner: 10Elukey) [17:57:03] (03CR) 10SBassett: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [17:57:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [17:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:17] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10sdkim) [17:58:57] !log bblack@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install6001.wikimedia.org [17:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:05] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10sdkim) For whoever picks this up, please reach out and I can provide the expected traffic they have provided. [17:59:15] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10sdkim) [17:59:42] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10thcipriani) >>! In T297323#7565793, @MatthewVernon wrote: > @thcipriani are you OK to approve this request, please? [or suggest someone else in releng who might be appropriate to do so?] Approved!... [18:00:10] (03CR) 10BBlack: [C: 03+2] install6001: add site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/748174 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:04:01] (03CR) 10Lucas Werkmeister (WMDE): wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [18:07:08] (03PS2) 10BBlack: install6001: use for drmrs installs [puppet] - 10https://gerrit.wikimedia.org/r/748175 (https://phabricator.wikimedia.org/T282787) [18:07:09] (03PS1) 10BBlack: install6001: dhcp entry [puppet] - 10https://gerrit.wikimedia.org/r/748182 (https://phabricator.wikimedia.org/T282787) [18:08:26] (03CR) 10BBlack: [C: 03+2] install6001: dhcp entry [puppet] - 10https://gerrit.wikimedia.org/r/748182 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:10:00] (03Abandoned) 10Ebernhardson: dumps: Move cirrus dumps to friday [puppet] - 10https://gerrit.wikimedia.org/r/747879 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [18:15:41] (03CR) 10BBlack: [C: 03+2] install6001: use for drmrs installs [puppet] - 10https://gerrit.wikimedia.org/r/748175 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:35:50] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:38:02] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:38:02] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:39:51] (03CR) 10RLazarus: [C: 03+2] Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:41:54] (03Merged) 10jenkins-bot: Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:42:28] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:49:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS buster [18:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster [18:50:48] (03PS1) 10Accraze: articlequality: update transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/748185 (https://phabricator.wikimedia.org/T294141) [18:51:09] (03CR) 10jerkins-bot: [V: 04-1] articlequality: update transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/748185 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [18:53:18] (03CR) 10Herron: [C: 04-1] "The approach LGTM, but I'm not fond of naming this (profile|role)::apifeatureusage::collector. Since the software is logstash I think thi" [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:56:36] (03CR) 10Herron: [C: 03+1] "LGTM overall but please see comment" [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:58:28] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:59:24] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: Add --purge-older-than option [puppet] - 10https://gerrit.wikimedia.org/r/748206 (https://phabricator.wikimedia.org/T294429) [19:00:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) @Volans These servers will not install correctly, I noticed that these have embedded 1G nic cards as the primary nic and I suspect the c... [19:01:17] (03PS1) 10RLazarus: Release v0.0.2 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748207 [19:03:16] (03CR) 10Ryan Kemper: [C: 03+2] admin: add bking to shell users [puppet] - 10https://gerrit.wikimedia.org/r/748135 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [19:04:05] (03CR) 10RLazarus: [C: 03+2] Release v0.0.2 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748207 (owner: 10RLazarus) [19:04:45] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Aklapper) 05Openβ†’03Stalled [19:06:13] (03CR) 10Ryan Kemper: [C: 03+2] admin: Add Brian King to ops and analytics_privatedata_users groups [puppet] - 10https://gerrit.wikimedia.org/r/748137 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [19:06:26] (03PS2) 10Ryan Kemper: admin: Add Brian King to ops and analytics_privatedata_users groups [puppet] - 10https://gerrit.wikimedia.org/r/748137 (https://phabricator.wikimedia.org/T297910) (owner: 10Bking) [19:06:46] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1025.eqiad.wmnet with OS buster [19:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with... [19:07:31] (03Merged) 10jenkins-bot: Release v0.0.2 [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748207 (owner: 10RLazarus) [19:15:11] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.2-1_amd64.changes [19:15:13] (03PS2) 10Accraze: articlequality: update transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/748185 (https://phabricator.wikimedia.org/T294141) [19:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:15] (03CR) 10Elukey: [C: 03+2] articlequality: update transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/748185 (https://phabricator.wikimedia.org/T294141) (owner: 10Accraze) [19:21:55] 10SRE, 10observability, 10service-runner, 10serviceops-radar: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10Ottomata) I had to go back and check, but both eventgate and evenstreams are using service-runner prometheus directly... [19:22:50] 10SRE, 10observability, 10service-runner, 10serviceops-radar: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10Ottomata) Oh, is that not related? Anyway, I'm not aware of any alerts on GC stats, and at the very worst we'll have... [19:23:45] (03CR) 10BBlack: [C: 03+2] install6001: use as proxy for drmrs [dns] - 10https://gerrit.wikimedia.org/r/748178 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:25:39] (03CR) 10Dzahn: [C: 03+2] mediawiki: Move mwmaint system::roles to the role [puppet] - 10https://gerrit.wikimedia.org/r/748101 (owner: 10Majavah) [19:26:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:30] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [19:33:38] 10SRE, 10Infrastructure-Foundations, 10Mail, 10fr-donorservices: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) [19:33:44] 10SRE, 10Infrastructure-Foundations, 10Mail, 10fr-donorservices: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) boldy merging into T297915 based on T297307#7577554 [19:33:54] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Ok, thanks! Done. I merged the tickets into one. [19:39:46] !log puppetmaster1001 - sudo puppet cert clean webserver-misc-apps.discovery.wmnet - Revoked certificate with serial 8502 [19:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:16] !log T297910 `ryankemper@mwmaint1002:~$ sudo modify-ldap-group ops` to add `bking` [19:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:45] !log T297910 `ryankemper@mwmaint1002:~$ sudo modify-ldap-group wmf` to add `bking` [19:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] !log adding dbtree.wikimedia.org and tendril.wikimedia.org to TLS cert for webserver-misc-apps.discovery.wmnet - recreating cert T297605 [19:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:10] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [19:46:23] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: Add --purge-older-than option [puppet] - 10https://gerrit.wikimedia.org/r/748206 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [19:52:05] (03PS1) 10Dzahn: ssl: update certificate for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/748210 (https://phabricator.wikimedia.org/T297605) [19:52:41] (03CR) 10jerkins-bot: [V: 04-1] ssl: update certificate for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/748210 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [19:52:57] (03PS2) 10Dzahn: ssl: update certificate for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/748210 (https://phabricator.wikimedia.org/T297605) [19:53:03] (03CR) 10Dzahn: [V: 03+1] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/748210 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [19:53:57] (03CR) 10Bking: [C: 03+2] cirrussearch: s/sanitizer/saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [19:54:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] ssl: update certificate for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/748210 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [20:04:41] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RKemper) Added bking to LDAP users and merged the puppet patches. This should be all done AFAICT. [20:07:32] (03CR) 10Dzahn: "@Bking Welcome !:) I merged this on the puppetmaster (sudo puppet-merge) because both our changes were conflicting and this seemed harmles" [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [20:14:09] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: Support periodic full backups and purges [puppet] - 10https://gerrit.wikimedia.org/r/748211 [20:14:53] (03CR) 10jerkins-bot: [V: 04-1] wmcs-cinder-backup-manager: Support periodic full backups and purges [puppet] - 10https://gerrit.wikimedia.org/r/748211 (owner: 10Andrew Bogott) [20:16:39] (03PS2) 10Andrew Bogott: wmcs-cinder-backup-manager: Support periodic full backups and purges [puppet] - 10https://gerrit.wikimedia.org/r/748211 (https://phabricator.wikimedia.org/T294429) [20:18:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: Support periodic full backups and purges [puppet] - 10https://gerrit.wikimedia.org/r/748211 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [20:21:46] (03PS2) 10Legoktm: Delete unused role::mediawiki::irc_events [puppet] - 10https://gerrit.wikimedia.org/r/747639 (https://phabricator.wikimedia.org/T272559) [20:23:39] (03CR) 10Legoktm: [C: 03+2] Delete unused role::mediawiki::irc_events [puppet] - 10https://gerrit.wikimedia.org/r/747639 (https://phabricator.wikimedia.org/T272559) (owner: 10Legoktm) [20:24:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Legoktm) [20:29:10] (03PS4) 10Legoktm: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) [20:31:39] (03CR) 10Legoktm: [C: 03+2] Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) (owner: 10Legoktm) [20:32:01] (03PS3) 10Legoktm: Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 [20:32:21] (03Merged) 10jenkins-bot: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) (owner: 10Legoktm) [20:33:22] (03PS1) 10Dzahn: ssl: additionally add tendril-static to certificate for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/748212 (https://phabricator.wikimedia.org/T297605) [20:33:43] (03PS2) 10Dzahn: ssl: additionally add tendril-static to certificate for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/748212 (https://phabricator.wikimedia.org/T297605) [20:34:13] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Set $wgMaxImageArea = false; (T291014) (duration: 00m 59s) [20:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:18] T291014: Terminate all implicit use of VipsScaler code from Wikimedia production so we can remove it without breaking things this time - https://phabricator.wikimedia.org/T291014 [20:34:21] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/748212 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [20:39:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:37] (03PS3) 10Dzahn: wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 [20:41:12] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [20:43:08] (03PS4) 10Dzahn: wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 [20:43:44] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [20:43:53] (03CR) 10Dzahn: "@jem so that you can just type "update_all.sh" etc from now on" [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [20:46:17] (03PS5) 10Dzahn: wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 [20:47:02] (03CR) 10Dzahn: [C: 03+2] wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [20:49:49] (03CR) 10Herron: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [20:49:57] (03CR) 10Cwhite: role: add apifeatureusage role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [20:53:45] (03PS7) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [20:56:20] !log puppetmaster - revoking and recreating TLS cert for miscweb one more time because "tendril-static" isn't "static-tendril" ;Pp [20:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:17] (03CR) 10Herron: [C: 03+1] role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [21:00:07] (03PS1) 10Dzahn: miscweb: fix TLS cert because tendril-static != static-tendril [puppet] - 10https://gerrit.wikimedia.org/r/748213 (https://phabricator.wikimedia.org/T297605) [21:00:45] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "openssl x509 -in webserver-misc-apps.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/748213 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [21:02:48] (03PS1) 10Clare Ming: Fix wordmark svgs for strategywiki, viwikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748214 (https://phabricator.wikimedia.org/T290091) [21:03:45] (03CR) 10Legoktm: "This probably needs a full scap sync-world. Next week maybe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [21:05:47] (03CR) 10Dzahn: gitlab_runner: use config template for registering new runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [21:06:48] (03CR) 10Dzahn: "code look solid, just more of a comment to clarify the priveleged true/false setting." [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [21:08:10] !log repooling wtp1025 [21:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:30] Kemayo: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1937618&oldid=1937485 [21:17:31] !log bblack@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus6001.drmrs.wmnet [21:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:00] (03PS2) 10Clare Ming: Fix wordmark svgs for strategywiki, viwikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748214 (https://phabricator.wikimedia.org/T290091) [21:21:21] legoktm: thanks! [21:21:40] !log bblack@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus6001.drmrs.wmnet [21:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:41] (03PS1) 10BBlack: drmrs: remove fake prometheus6001 dns entry [dns] - 10https://gerrit.wikimedia.org/r/748215 (https://phabricator.wikimedia.org/T282787) [21:27:30] (03CR) 10BBlack: [C: 03+2] drmrs: remove fake prometheus6001 dns entry [dns] - 10https://gerrit.wikimedia.org/r/748215 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [21:28:44] !log bblack@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus6001.drmrs.wmnet [21:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:29] (03PS3) 10Dzahn: service/miscweb: switch state from lvs_setup to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/694629 (https://phabricator.wikimedia.org/T281538) [21:35:13] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "affecting only Icinga checks: https://puppet-compiler.wmflabs.org/pcc-worker1002/33053/ - https://puppet-compiler.wmflabs.org/pcc-worker10" [puppet] - 10https://gerrit.wikimedia.org/r/694629 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:37:41] (03CR) 10Dzahn: "27 requests sent to miscweb1002.eqiad.wmnet. 1 request with failed assertions. But the other 26 are fine" [puppet] - 10https://gerrit.wikimedia.org/r/748213 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [21:47:30] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) switched service from lvs_setup to monitoring_setup. new Icinga checks confirmed and working for both DCs https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=... [21:54:50] (03PS1) 10Dzahn: add static-tendril.wikimedia.org ServerName to its apache site [puppet] - 10https://gerrit.wikimedia.org/r/748216 (https://phabricator.wikimedia.org/T297605) [21:55:55] (03PS2) 10Dzahn: add static-tendril.wikimedia.org ServerName to its apache site [puppet] - 10https://gerrit.wikimedia.org/r/748216 (https://phabricator.wikimedia.org/T297605) [22:02:24] (03PS1) 10Dzahn: static-tendril: give apache site the name matching the virtual host [puppet] - 10https://gerrit.wikimedia.org/r/748217 (https://phabricator.wikimedia.org/T297605) [22:10:04] (03PS1) 10Dzahn: httpbb: add/fix tests for dbtree/tendril/static-tendril on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/748218 (https://phabricator.wikimedia.org/T297605) [22:10:24] (03CR) 10Dzahn: [C: 03+2] add static-tendril.wikimedia.org ServerName to its apache site [puppet] - 10https://gerrit.wikimedia.org/r/748216 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [22:23:05] (03CR) 10JHathaway: mirrors.wikimedia.org: point to new mirror (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [22:23:52] (03CR) 10Dzahn: [C: 03+2] static-tendril: give apache site the name matching the virtual host [puppet] - 10https://gerrit.wikimedia.org/r/748217 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [22:24:44] (03CR) 10Dzahn: [C: 03+2] httpbb: add/fix tests for dbtree/tendril/static-tendril on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/748218 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [22:24:55] (03CR) 10Dzahn: "testing with live editing" [puppet] - 10https://gerrit.wikimedia.org/r/748218 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [22:27:13] (03CR) 10Dzahn: "PASS: 29 requests sent to miscweb2002.codfw.wmnet. All assertions passed." [puppet] - 10https://gerrit.wikimedia.org/r/748218 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [22:30:57] !log bblack@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus6001.drmrs.wmnet [22:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:22] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748222 [22:35:32] (03CR) 10Dzahn: "not sure if needed in this case but just wanted to put out there the option to first lower TTL from 1H to 5M to be able to switch back fas" [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [22:54:13] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748222 (owner: 10PipelineBot) [22:57:40] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748222 (owner: 10PipelineBot) [22:59:53] (03PS2) 10JHathaway: mirrors.wikimedia.org: point to new mirror [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) [23:01:07] (03CR) 10JHathaway: mirrors.wikimedia.org: point to new mirror (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [23:07:00] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [23:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:58] !log Testing T297987 [23:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:04] T297987: Test - https://phabricator.wikimedia.org/T297987 [23:12:09] !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:56] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King - https://phabricator.wikimedia.org/T297910 (10RKemper) 05In progressβ†’03Resolved [23:13:01] !log T297910 foobar testing 1 2 3 [23:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:39] !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:41] !log T297986 Beep boop testing 1 2 3 disregard me [23:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:45] T297986: [Tracking task] Pair with brian king on various operational tasks - https://phabricator.wikimedia.org/T297986 [23:14:52] hey that last part rhymes [23:21:10] (03PS1) 10BBlack: prometheus6001: macaddr and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/748224 (https://phabricator.wikimedia.org/T282787) [23:21:12] (03PS1) 10BBlack: prometheus6001: add to global node list [puppet] - 10https://gerrit.wikimedia.org/r/748225 (https://phabricator.wikimedia.org/T282787) [23:22:22] (03CR) 10BBlack: [C: 03+2] prometheus6001: macaddr and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/748224 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [23:43:59] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook