[00:00:00] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31966/" [puppet] - 10https://gerrit.wikimedia.org/r/734798 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T0000). [00:00:11] I mean, as a policy matter we don't allow appservers to talk to wmcloud [00:01:05] AIUI this was discussed with SRE/Security and they decided to make an exception [00:01:18] did anyone tell the infrastructure that? [00:01:18] or maybe relax that policy more generally, not sure [00:03:42] you can probably get away with using $wmfLocalServices['urldownloader'] as the proxy [00:04:15] but this is the first I'm hearing about this [00:04:19] Thanks, I'll look into it. Thanks for the deploy RoanKattouw! not sure if it's worth reverting for now, it's testwiki-only. [00:06:35] legoktm: I think the discussion was between Seve, akosiaris and John Bennett. [00:06:41] or at least included them. [00:06:53] thanks, I'll ask [00:07:05] should I announce / propose it somewhere? [00:08:29] not sure, I'm checking with Alex if this is something we already decided to support [00:08:49] it seems like a bad idea to me for all the reasons we already don't depend on wmcloud stuff in production, but :shrug: [00:09:42] maybe start a thread on good old ops mailing list [00:10:30] AIUI the argument was that there should be a way to do product validation (an A/B test in this case) before doing the work of productizing a new service [00:13:12] anyway, let me know if I can help with the decision / discussion somehow. I'll see if it works with urldownloader and add you for review. [00:14:01] ack [00:17:54] (03PS2) 10Dzahn: service/miscweb: switch state from monitoring_setup to production [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) [00:18:48] (03CR) 10jerkins-bot: [V: 04-1] service/miscweb: switch state from monitoring_setup to production [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:33:56] (03PS1) 10Dzahn: hieradata/hosts: removing non-existent host files [puppet] - 10https://gerrit.wikimedia.org/r/735082 [00:38:00] (03PS3) 10Dzahn: add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) [00:44:11] (03CR) 10Dzahn: "@Legoktm Could you please take a look here some time? This is very similar to when you added Toolhub to LVS at https://gerrit.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:47:05] (03CR) 10Dzahn: "should discovery DNS before or after LVS config? besides this one I have 4 other changes. one for discovery DNS and 3 for the LVS "stages"" [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:12:58] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:59:32] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 173 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:05:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 40 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:15:32] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 227 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:21:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 42 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:29:51] (03PS1) 10AntiCompositeNumber: wikireplicas: add Translate extension tables [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) [03:40:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 264 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:46:28] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 40 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:25:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17622 and previous config saved to /var/cache/conftool/dbconfig/20211028-050052-marostegui.json [05:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:01] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [05:09:12] (03PS1) 10Marostegui: production-m5.sql: Add striker GRANTS [puppet] - 10https://gerrit.wikimedia.org/r/735092 (https://phabricator.wikimedia.org/T288093) [05:09:52] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add striker GRANTS [puppet] - 10https://gerrit.wikimedia.org/r/735092 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [05:19:39] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Marostegui) I will talk to @Ladsgroup about this, as I am missing lots of context here and the implications this could have. [05:25:12] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Puppet failing on deployment-docker-* hosts - https://phabricator.wikimedia.org/T294517 (10Majavah) p:05Triage→03High [05:26:07] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 3 others: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10odimitrijevic) [05:39:44] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10odimitrijevic) [05:53:02] (03PS1) 10Gergő Tisza: Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) [06:14:23] (03PS2) 10Ayounsi: Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) [06:15:28] (03CR) 10Ayounsi: [C: 03+2] Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [06:16:07] (03Merged) 10jenkins-bot: Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [06:17:18] !log Remove GRE tunnel between cr4-ulsfo and cr2-eqsin - T273308 [06:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:09] 10SRE, 10Traffic, 10observability, 10Discovery-Search (Current work): flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10elukey) Another option could be to use httpd from buster-backports, but https://packages.debian.org/buster-backports/apache2 s... [06:34:43] (03PS1) 10Ayounsi: Remove includes for eqsin-ulsfo GRE tunnel prefix [dns] - 10https://gerrit.wikimedia.org/r/735293 (https://phabricator.wikimedia.org/T273308) [06:38:47] !log depool cp5011 and restart varnish-frontend (ABI errors while reloading after digicert changes) [06:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:41] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) @Legoktm did you get to test this in cloud vps? [06:41:18] PROBLEM - Confd vcl based reload on cp5011 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [06:42:14] mmm this seems to be a stale monitor, should solve soon (host still depooled) [06:49:14] Juan_90264: thanks for the deploy: sorry I wasn't around, i expected it later today ^_^ [06:49:33] (03PS1) 10Elukey: profile::pki::root_ca: add new kafka intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/735294 (https://phabricator.wikimedia.org/T291905) [07:11:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:19:38] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:29] (03CR) 10Ema: [C: 03+2] varnishreqstats.mtail: remove wildcard match [puppet] - 10https://gerrit.wikimedia.org/r/734970 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [07:26:42] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:32:19] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5011.eqsin.wmnet [07:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:20] RECOVERY - Confd vcl based reload on cp5011 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [07:57:07] (03CR) 10Volans: [C: 03+1] "LGTM, optional question inline" [dns] - 10https://gerrit.wikimedia.org/r/735293 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [08:07:43] 10SRE, 10Observability-Logging, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) The optimizations to varnishxcache.mtail and varnishreqstats.mtail paid off, time spent in `tryBacktrack`... [08:20:46] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:20:46] (03PS1) 10Vgutierrez: acme_chief: Page on acme-chief unit failure [puppet] - 10https://gerrit.wikimedia.org/r/735297 (https://phabricator.wikimedia.org/T292619) [08:20:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:54] jouncebot: next [08:21:54] In 0 hour(s) and 8 minute(s): Create new wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T0830) [08:25:40] (03PS2) 10Urbanecm: Initial configuration for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733142 (https://phabricator.wikimedia.org/T292415) [08:25:49] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31967/console" [puppet] - 10https://gerrit.wikimedia.org/r/735297 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [08:25:56] (03PS2) 10Urbanecm: Initial configuration for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733143 (https://phabricator.wikimedia.org/T292414) [08:26:10] (03PS2) 10Urbanecm: Initial configuration for lmowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733145 (https://phabricator.wikimedia.org/T291390) [08:26:57] (03PS3) 10Urbanecm: Initial configuration for lmowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733145 (https://phabricator.wikimedia.org/T291390) [08:29:22] (03CR) 10Jelto: [C: 04-1] "gitlab::backup_dir_data and gitlab::backup_dir_config are used here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ref" [puppet] - 10https://gerrit.wikimedia.org/r/735016 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:30:04] Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Create new wikis . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T0830). [08:30:15] o/ [08:30:21] the wonderful time of year again :) [08:30:32] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733142 (https://phabricator.wikimedia.org/T292415) (owner: 10Urbanecm) [08:30:42] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:55] !log A:cp start rolling varnish upgrades to 6.0.8-1wm2 T293879 [08:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:02] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:31:20] (03Merged) 10jenkins-bot: Initial configuration for pwnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733142 (https://phabricator.wikimedia.org/T292415) (owner: 10Urbanecm) [08:31:57] (03PS1) 10Urbanecm: lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) [08:32:34] (03CR) 10jerkins-bot: [V: 04-1] lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [08:34:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:08] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:37:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:37:20] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating pwnwiki (T292415) (duration: 01m 03s) [08:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:29] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [08:38:39] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating pwnwiki (T292415) (duration: 01m 03s) [08:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31968/console" [puppet] - 10https://gerrit.wikimedia.org/r/734986 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:39:17] zabe: lol, apparently I'm not the only one trying to collect low user ids on new wikis :D [08:39:30] majavah: exactly my thoughts :D [08:39:40] definetly :D [08:39:43] !log urbanecm@deploy1002 Synchronized dblists: Creating pwnwiki (T292415) (duration: 01m 02s) [08:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:55] (03CR) 10David Caro: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/734986 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:40:02] no one can beat Maintenance script though :D [08:40:32] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733143 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [08:40:34] (03CR) 10David Caro: [C: 03+1] P:wmcs::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/735027 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:40:55] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating pwnwiki (T292415) [08:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:27] (03Merged) 10jenkins-bot: Initial configuration for amiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733143 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [08:41:31] majavah: i'm here too [08:41:58] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating pwnwiki (T292415) (duration: 01m 03s) [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:19] Spookreeeno: ah, I didn't see you since I didn't bother checking IDs above mine [08:42:30] (no replicas yet, so forms which accept user ids are the only way to check at this point) [08:42:52] * Spookreeeno is bored waiting for scp [08:43:01] !log urbanecm@deploy1002 Synchronized langlist: Creating pwnwiki (T292415) (duration: 01m 02s) [08:43:05] * urbanecm is wondering how is scp relevant [08:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:06] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [08:43:28] because otherwise i'd have things to do other than get low user ids on new wikis [08:46:24] majavah: congratulations to user_id=3 [08:46:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:05] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating amiwiki (T292414) (duration: 01m 02s) [08:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:10] T292414: Create Wikipedia Amis - https://phabricator.wikimedia.org/T292414 [08:47:31] (03PS2) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [08:47:33] (03PS2) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [08:47:35] (03PS1) 10Vgutierrez: cache::haproxy: Remove deprecated require_package [puppet] - 10https://gerrit.wikimedia.org/r/735302 [08:48:08] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating amiwiki (T292414) (duration: 01m 02s) [08:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] hahahaha :D [08:49:11] !log urbanecm@deploy1002 Synchronized dblists: Creating amiwiki (T292414) (duration: 01m 02s) [08:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:34] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=pwnwiki --cluster=all # T292415 [08:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:40] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [08:50:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:22] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating amiwiki (T292414) [08:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:38] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=amiwiki --cluster=all # T292414 [08:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating amiwiki (T292414) (duration: 01m 02s) [08:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for lmowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733145 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [08:52:27] !log urbanecm@deploy1002 Synchronized langlist: Creating amiwiki (T292414) (duration: 01m 02s) [08:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:34] T292414: Create Wikipedia Amis - https://phabricator.wikimedia.org/T292414 [08:52:40] (03Merged) 10jenkins-bot: Initial configuration for lmowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733145 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [08:52:46] just in time [08:53:49] (03CR) 10David Caro: "Thanks Majavah!" [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [08:58:42] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] (03PS1) 10Urbanecm: lmowiktionary: Add missing rules in db-* files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735303 (https://phabricator.wikimedia.org/T291390) [08:59:55] i hate it when this happens [09:00:02] fortunately, DB creation happened in the right place [09:00:11] (03CR) 10Urbanecm: [C: 03+2] lmowiktionary: Add missing rules in db-* files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735303 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [09:00:40] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:00:53] (03Merged) 10jenkins-bot: lmowiktionary: Add missing rules in db-* files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735303 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [09:02:48] urbanecm: this time majavah beat you [09:02:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:01] zabe: urbanecm: let's play a game :D https://phabricator.wikimedia.org/P17624 [09:07:09] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:15] T291390: Create Wiktionary Lombard - https://phabricator.wikimedia.org/T291390 [09:07:30] nice way of workarounding "no replication" :D [09:08:12] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] (03PS3) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [09:09:06] (03PS3) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [09:09:08] (03PS1) 10Vgutierrez: profile::haproxy: Fix typo on X-Analytics-TLS value [puppet] - 10https://gerrit.wikimedia.org/r/735305 (https://phabricator.wikimedia.org/T290005) [09:09:14] !log urbanecm@deploy1002 Synchronized dblists: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:25] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating lmowiktionary (T291390) [09:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:28] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:30] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:36] T291390: Create Wiktionary Lombard - https://phabricator.wikimedia.org/T291390 [09:13:33] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=lmowiktionary --cluster=all # T291390 [09:13:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating lmowiktionary (T291390) (duration: 01m 02s) [09:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:53] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735307 [09:13:55] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735307 (owner: 10Urbanecm) [09:14:47] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735307 (owner: 10Urbanecm) [09:15:30] (03PS2) 10Urbanecm: lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) [09:15:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:54] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 06s) [09:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:23] (03CR) 10jerkins-bot: [V: 04-1] lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [09:16:36] (03PS3) 10Urbanecm: lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) [09:16:55] (03CR) 10Urbanecm: [C: 03+2] lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [09:17:51] (03Merged) 10jenkins-bot: lmowiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735299 (https://phabricator.wikimedia.org/T291390) (owner: 10Urbanecm) [09:19:12] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 532f8e5d476a1e2d5d953371a07f0aeb8c01bbaf: lmowiktionary: Create Appendix namespace (T291390) (duration: 01m 03s) [09:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] T291390: Create Wiktionary Lombard - https://phabricator.wikimedia.org/T291390 [09:19:41] !log Wiki creation done. pwnwiki, amiwiki and lmowiktionary got created. [09:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:22] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:25:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:22] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:28:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:12] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Remove deprecated require_package [puppet] - 10https://gerrit.wikimedia.org/r/735302 (owner: 10Vgutierrez) [09:32:55] (03PS1) 10QChris: Add .gitreview [software/wmfdb] - 10https://gerrit.wikimedia.org/r/735309 [09:32:57] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/wmfdb] - 10https://gerrit.wikimedia.org/r/735309 (owner: 10QChris) [09:33:03] (03CR) 10Vgutierrez: [C: 03+2] profile::haproxy: Fix typo on X-Analytics-TLS value [puppet] - 10https://gerrit.wikimedia.org/r/735305 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:50:01] 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10cmooney) Thanks for the run down Arzhel. Unfortunate incident, easily understood in hindsight but a quirky edge case - I can understand how we overlooked the potential for it... [09:50:59] (03CR) 10Jbond: C:gitlab: drop undefined variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735016 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:00:05] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1000). Please do the needful. [10:02:59] (03CR) 10Btullis: [C: 03+1] "Looks OK to me. I take it that the one compilation failure from the PCC run was unrelated to the change?" [puppet] - 10https://gerrit.wikimedia.org/r/735023 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:03:16] (03CR) 10JMeybohm: [C: 04-1] blubberoid: bump common_templates to 0.4 and chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:06:40] (03CR) 10Btullis: [C: 03+1] C:bigtop::hue: ensure all variables are deifned [puppet] - 10https://gerrit.wikimedia.org/r/735028 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:10:31] (03CR) 10Jelto: blubberoid: bump common_templates to 0.4 and chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:17:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31969/console" [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:17:51] (03CR) 10JMeybohm: [C: 04-1] blubberoid: bump common_templates to 0.4 and chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:25:21] (03CR) 10Btullis: [V: 03+1] C:statistics::compute: correct user param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735029 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:25:42] (03PS2) 10Jelto: blubberoid: bump common_templates to 0.4 and chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) [10:29:36] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:32:44] (03CR) 10Jelto: blubberoid: bump common_templates to 0.4 and chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:33:05] (03CR) 10Btullis: "Wikitech mentions a manual process for creating the new intermediate:" [puppet] - 10https://gerrit.wikimedia.org/r/735294 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:33:54] (03PS1) 10Jbond: README: update build local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 [10:36:49] (03CR) 10Elukey: profile::pki::root_ca: add new kafka intermediate CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735294 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:46:52] (03PS1) 10David Caro: general: some dev-related improvements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 [10:54:09] (03CR) 10jerkins-bot: [V: 04-1] README: update build local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, and apergos: Dear deployers, time to do the UTC morning backport and config training deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1100). [11:00:04] inductiveload: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] o/ [11:00:18] vaca [11:00:23] see you next week [11:00:30] (bank holiday) [11:00:46] it's OK, the patches went out already [11:00:50] yup, just saw that [11:00:54] enjoy your holiday :-) [11:01:00] nothing to do then [11:01:04] yep [11:04:29] (03PS4) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [11:04:31] (03PS4) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [11:04:33] (03PS1) 10Vgutierrez: haproxy: Allow setting variables [puppet] - 10https://gerrit.wikimedia.org/r/735324 (https://phabricator.wikimedia.org/T290005) [11:04:35] (03PS1) 10Vgutierrez: cache::haproxy: Fix missing_xwd ACL [puppet] - 10https://gerrit.wikimedia.org/r/735325 (https://phabricator.wikimedia.org/T290005) [11:05:03] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [11:05:03] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:16] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] !log volans@cumin2002 Updating IPMI password on 0 hosts - volans@cumin2002 - T283050 [11:06:02] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:07] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [11:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:33] (03PS1) 10Zabe: Add permissions to eleminators on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) [11:16:53] (03CR) 10Urbanecm: [C: 03+1] "LGTM, one nit inline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) (owner: 10Zabe) [11:18:18] (03PS2) 10Zabe: Add permissions to eleminators on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) [11:18:43] (03CR) 10Zabe: Add permissions to eleminators on viwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) (owner: 10Zabe) [11:19:05] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) (owner: 10Zabe) [11:19:07] thanks zabe [11:20:01] if some deployer has time we can do that one ^ [11:20:06] (03PS1) 10Volans: sre.hosts.ipmi-password-reset: support new hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 [11:21:27] (03CR) 10Volans: "I've done the minimal changes to get this working for new hardware. We should refactor this a bit later to use the new class API and simpl" [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 (owner: 10Volans) [11:23:29] jouncebot: now [11:23:29] For the next 0 hour(s) and 36 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1100) [11:23:50] Lucas_WMDE: double checking if i can deploy stuff now? [11:24:32] (03CR) 10Cathal Mooney: [C: 03+1] "Following the code and trying to see what it's doing the changes makes sense to me. But probably someone with better knowledge of this sh" [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 (owner: 10Volans) [11:25:07] urbanecm: go ahead [11:25:10] thanks [11:25:13] (03CR) 10Urbanecm: [C: 03+2] Add permissions to eleminators on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) (owner: 10Zabe) [11:26:06] (03Merged) 10jenkins-bot: Add permissions to eleminators on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735347 (https://phabricator.wikimedia.org/T294530) (owner: 10Zabe) [11:26:36] zabe: it's at mwdebug1001 [11:27:31] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.ipmi-password-reset: support new hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 (owner: 10Volans) [11:28:05] urbanecm: lgtm [11:28:19] syncing [11:29:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 77df86abd173cec7f1a87c85fc65ce5faca329cb: Add permissions to eleminators on viwiki (T294530) (duration: 01m 04s) [11:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:46] T294530: New rights for vi.wiki eliminators - https://phabricator.wikimedia.org/T294530 [11:29:47] zabe: done. Anything else? [11:30:23] no, thanks :) [11:30:26] np [11:32:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:58] (03CR) 10Volans: [C: 03+2] sre.hosts.ipmi-password-reset: support new hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 (owner: 10Volans) [11:35:24] Is FlaggedRevs really broken again... T294544 [11:35:24] T294544: FlaggedRevs does not work in german wiktionary - https://phabricator.wikimedia.org/T294544 [11:35:34] (03CR) 10Kosta Harlan: [C: 03+1] Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [11:36:26] (03Merged) 10jenkins-bot: sre.hosts.ipmi-password-reset: support new hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/735360 (owner: 10Volans) [11:37:14] zabe: everyone's favourite extension to block the train on :D [11:37:28] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [11:37:29] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=1) [11:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:57] definetly, it always makes fun :D [11:39:03] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [11:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:49] !log volans@cumin2002 Updating IPMI password on 1 hosts - volans@cumin2002 - T283050 [11:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [11:40:31] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:40:32] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:45:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:13] (03CR) 10Jbond: [C: 03+2] P:wmcs::monitoring: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/735027 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [11:49:14] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 (owner: 10Jbond) [11:49:18] (03CR) 10Jbond: [C: 03+2] README: update build local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 (owner: 10Jbond) [11:51:16] (03CR) 10Jbond: [C: 03+2] bigtop::mysql_jdbc: ensure package_name variable is always defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735023 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [11:51:40] (03CR) 10Jbond: [C: 03+2] C:bigtop::hue: ensure all variables are deifned [puppet] - 10https://gerrit.wikimedia.org/r/735028 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [11:53:29] (03PS1) 10Lucas Werkmeister (WMDE): Load Wikibase Client before other Wikibase extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735367 (https://phabricator.wikimedia.org/T294224) [11:54:38] (03PS5) 10Urbanecm: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) [11:55:49] urbanecm: \o/ [11:56:00] (re foundationwiki sul( [11:56:10] hi majavah! Just waiting for jouncebot to announce start of the window :) [11:56:57] (03CR) 10Jbond: general: some dev-related improvements (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (owner: 10David Caro) [11:58:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735294 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:59:08] majavah: my plan is to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/717506, pull to mwmaint1002, run extensions/CentralAuth/maintenance/migratePass0.php, pull to mwdebug1001 and then make sure i can login via my local foundationwiki credentials. [11:59:27] does that look good to you? [11:59:42] (03CR) 10Ema: puppetboard: add puppetboard as an active/active service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/734263 (owner: 10Jbond) [11:59:54] pass0 is the one that imports localnames and globalnames tables? [12:00:04] Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do foundation.wikimedia.org SUL migration deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1200). [12:00:09] o/ [12:00:24] majavah: yeah, as stated at https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/maintenance/migratePass0.php [12:00:37] sounds good in theory [12:00:40] (03CR) 10Ema: [C: 03+1] puppetboard: add puppetboard as an active/active service [dns] - 10https://gerrit.wikimedia.org/r/734262 (owner: 10Jbond) [12:00:43] let's see what happens in practice :P [12:00:46] that's at least how i did it at beta :D [12:00:54] so yeah, let's try [12:00:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735082 (owner: 10Dzahn) [12:01:04] (03CR) 10Urbanecm: [C: 03+2] Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [12:02:04] (03Merged) 10jenkins-bot: Connect foundationwiki to SUL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [12:02:41] !log advertise esams prefix to NaWas - T288505 [12:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:19] !log urbanecm@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/migratePass0.php --wiki=foundationwiki # T205347, with 717506 pulled to mwmaint1002 [12:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:25] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:03:43] except newlines, it looks to work correctly: https://phabricator.wikimedia.org/P17625 [12:04:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:55] `wikiadmin@10.64.0.97(centralauth)> select count(*) from localnames where ln_wiki='foundationwiki';` says 1084, and so does `wikiadmin@10.64.32.82(foundationwiki)> select count(*) from user;` [12:05:11] so, pulling to mwdebug1001 [12:05:34] logging in works [12:05:39] trying Special:MergeAccount [12:05:53] oh, i didn't enable xwikimediadebug [12:06:32] new local user creation seems to work [12:06:56] i can login via xwikimediadebug too [12:07:20] logging in non-inkognito window throws "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form.", but that's likely due to some cookies mismatches [12:07:30] as long as it works in an inkognito window, I'm happy [12:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:37] this looks good to me https://www.irccloud.com/pastebin/h4YNENLt/ [12:08:52] i don't think there's more i can check via debug. [12:08:57] (03PS1) 10Jbond: P:trafficserver::monitoring: include dependent class profile::moniotring [puppet] - 10https://gerrit.wikimedia.org/r/735370 [12:08:59] so...unless majavah has other ideas, I'll sync [12:09:15] Special:CentralAuth/Martin_Urbanec on meta does not list your account [12:09:27] (03CR) 10jerkins-bot: [V: 04-1] README: update build local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 (owner: 10Jbond) [12:09:36] majavah: but https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(WMF) does [12:09:50] I don't have a volunteer acc there, only a staff one [12:10:03] ah, right [12:10:10] lgtm then [12:10:13] syncing [12:11:05] foundationwiki also lets me to use permissions coming from my +staff flag [12:11:09] so that part also works [12:11:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6f557db4ad82aa3ec80550a423d538099bf305fa: Connect foundationwiki to SUL (T205347) (duration: 01m 03s) [12:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:47] and....live [12:11:48] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:11:59] majavah: just to double check, you can't actually edit mainspace, correct? [12:12:31] "You do not have permission to edit pages in the Page namespace." [12:12:38] sounds good [12:12:42] urbanecm: don't you need to sync commonssettings? [12:12:52] (03CR) 10Vgutierrez: [C: 03+1] P:trafficserver::monitoring: include dependent class profile::moniotring [puppet] - 10https://gerrit.wikimedia.org/r/735370 (owner: 10Jbond) [12:13:01] https://foundation.wikimedia.org/wiki/MediaWiki:Badaccess-groups is empty, which makes some the message confusing though [12:13:01] zabe: good point, too used to syncing IS.php changes [12:13:24] majavah: yup, we'll deal with that later [12:14:06] now just let's do the same on wikitech :P [12:14:27] majavah: first get rid of wikitech being source of LDAP (in some ways) :D [12:14:42] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6f557db4ad82aa3ec80550a423d538099bf305fa: Connect foundationwiki to SUL (T205347; 1/3) (duration: 01m 03s) [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:57] yeah [12:15:26] (03PS5) 10Jbond: puppetboard: add puppetboard as an active/active service [puppet] - 10https://gerrit.wikimedia.org/r/734263 [12:16:26] !log urbanecm@deploy1002 Synchronized dblists/fishbowl.dblist: 6f557db4ad82aa3ec80550a423d538099bf305fa: Connect foundationwiki to SUL (T205347; 2/3) (duration: 01m 03s) [12:16:27] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/734263 (owner: 10Jbond) [12:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:00] volunteer account now got autocreated too [12:17:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:17:30] !log urbanecm@deploy1002 Synchronized wmf-config/config/foundationwiki.yaml: 6f557db4ad82aa3ec80550a423d538099bf305fa: Connect foundationwiki to SUL (T205347; 3/3) (duration: 01m 02s) [12:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:18:24] majavah: I can't get rid of the feeling "this was way too simple" [12:20:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:35] heh :D [12:21:58] * urbanecm goes to migrate some blocked accounts that don't exist in SUL [12:25:43] 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10fgiunchedi) [12:25:55] !log prepend our AS on all es/knams uplinks (except NaWas) - T288505 [12:25:58] [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=foundationwiki --safe --auto --userlist users2.txt works like a charm with a single account [12:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:06] let's run it on all of them [12:26:28] (03PS1) 10Cathal Mooney: Field descriptions for tcp_mss_clamping and prepend_as_out were reversed for some reason. Just swapping them around to fix. [homer/public] - 10https://gerrit.wikimedia.org/r/735372 [12:27:13] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=foundationwiki --safe --auto --userlist users.txt # T205347, users.txt is P17626 [12:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:19] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:27:20] 10SRE-swift-storage: Decom ms-be10[28-39] - https://phabricator.wikimedia.org/T294550 (10fgiunchedi) [12:28:39] (03CR) 10Cathal Mooney: [C: 03+2] Field descriptions for tcp_mss_clamping and prepend_as_out were reversed for some reason. Just swapping them around to fix. [homer/public] - 10https://gerrit.wikimedia.org/r/735372 (owner: 10Cathal Mooney) [12:29:12] (03Merged) 10jenkins-bot: Field descriptions for tcp_mss_clamping and prepend_as_out were reversed for some reason. Just swapping them around to fix. [homer/public] - 10https://gerrit.wikimedia.org/r/735372 (owner: 10Cathal Mooney) [12:29:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: ingress: disable snippet annotations [puppet] - 10https://gerrit.wikimedia.org/r/735373 (https://phabricator.wikimedia.org/T294330) [12:29:49] Did someone decide wether foundationwiki should be added to the global sysop opt out wiki set? [12:30:40] zabe: i don't think anyone officially decided that yet. But I'll make that decision, it definitely should be opted-out [12:30:52] done: https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=44128071 [12:30:56] I thought so [12:31:45] (03PS2) 10Arturo Borrero Gonzalez: toolforge: ingress: disable snippet annotations [puppet] - 10https://gerrit.wikimedia.org/r/735373 (https://phabricator.wikimedia.org/T294330) [12:32:10] thanks for raising that up zabe [12:32:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: ingress: disable snippet annotations [puppet] - 10https://gerrit.wikimedia.org/r/735373 (https://phabricator.wikimedia.org/T294330) (owner: 10Arturo Borrero Gonzalez) [12:32:52] (03CR) 10JMeybohm: [C: 03+1] blubberoid: bump common_templates to 0.4 and chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/734926 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [12:35:17] !log [urbanecm@mwmaint1002 ~/foundationwiki-sul]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=foundationwiki --userlist users.txt # T205347; users.txt is at P17627 [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:24] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:39:01] urbanecm: can you ping xSavitar when you done? [12:39:09] He got a train blocker [12:39:21] Spookreeeno: he can feel free to deploy now [12:39:27] i'm just mwmaint'ing now [12:39:30] xSavitar: ^ [12:39:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31976/console" [puppet] - 10https://gerrit.wikimedia.org/r/734263 (owner: 10Jbond) [12:39:51] (03CR) 10Jbond: [C: 03+2] P:trafficserver::monitoring: include dependent class profile::moniotring [puppet] - 10https://gerrit.wikimedia.org/r/735370 (owner: 10Jbond) [12:40:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [12:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:10] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [12:45:06] !log [urbanecm@mwmaint1002 ~/foundationwiki-sul]$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=foundationwiki --safe --auto --userlist users.txt # T205347, users.txt is at P17628 [12:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:13] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [12:46:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) @Dwisehaupt I checked the switch configuration and it's set correctly, I am willing to bet the servers are plugged into the wrong switch ports.... [12:47:03] (03PS2) 10Vgutierrez: haproxy: Allow setting variables [puppet] - 10https://gerrit.wikimedia.org/r/735324 (https://phabricator.wikimedia.org/T290005) [12:47:05] (03PS2) 10Vgutierrez: cache::haproxy: Fix missing_xwd ACL [puppet] - 10https://gerrit.wikimedia.org/r/735325 (https://phabricator.wikimedia.org/T290005) [12:47:07] (03PS5) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [12:47:09] (03PS5) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [12:48:48] !log [urbanecm@mwmaint1002 ~/foundationwiki-sul]$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=foundationwiki --safe --auto --userlist users.txt # T205347, users.txt P17629 [12:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:38] (03CR) 10Michael Große: [C: 03+1] Load Wikibase Client before other Wikibase extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735367 (https://phabricator.wikimedia.org/T294224) (owner: 10Lucas Werkmeister (WMDE)) [12:51:01] okay, 318 accounts migrated (out of about a thousand) [12:51:19] (mostly inactive accounts or accounts that just needed a new global acc to be created) [12:51:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31977/console" [puppet] - 10https://gerrit.wikimedia.org/r/734986 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:00:04] twentyafterfour and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1300). [13:00:41] There's a blocker [13:00:48] But we're getting it deployed soon [13:06:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] hiera: use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/734986 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:09:06] (03CR) 10Elukey: [C: 03+2] profile::pki::root_ca: add new kafka intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/735294 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:14:32] (03PS1) 10Jelto: C:gitlab: add default variables for backup_dir_data and backup_dir_config [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) [13:15:34] (03PS1) 10BBlack: Remove max_core_rtt variables [puppet] - 10https://gerrit.wikimedia.org/r/735382 (https://phabricator.wikimedia.org/T241239) [13:16:03] (03PS1) 10Ema: varnishrls.mtail: various optimizations [puppet] - 10https://gerrit.wikimedia.org/r/735383 (https://phabricator.wikimedia.org/T293879) [13:16:49] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31978/console" [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) (owner: 10Jelto) [13:17:24] (03CR) 10Jelto: C:gitlab: add default variables for backup_dir_data and backup_dir_config [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) (owner: 10Jelto) [13:17:38] (03CR) 10BBlack: [C: 03+2] Remove max_core_rtt variables [puppet] - 10https://gerrit.wikimedia.org/r/735382 (https://phabricator.wikimedia.org/T241239) (owner: 10BBlack) [13:18:33] (03CR) 10Jelto: [C: 04-1] C:gitlab: drop undefined variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735016 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:18:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31979/console" [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) (owner: 10Jelto) [13:20:46] (03CR) 10Ema: varnishrls.mtail: various optimizations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735383 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [13:22:33] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [13:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:02] !log volans@cumin2002 Updating IPMI password on 1 hosts - volans@cumin2002 - T283050 [13:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:08] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [13:23:13] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [13:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] (03PS1) 10Elukey: profile::pki: add new kafka intermediate CA public cert [puppet] - 10https://gerrit.wikimedia.org/r/735384 (https://phabricator.wikimedia.org/T291905) [13:24:15] (03PS1) 10Elukey: role::pki::multirootca: add kafka intermediate CA config [puppet] - 10https://gerrit.wikimedia.org/r/735385 (https://phabricator.wikimedia.org/T291905) [13:24:37] (03PS6) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [13:24:39] (03PS6) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [13:24:41] (03PS1) 10Vgutierrez: cache::haproxy: Bring systemd service unit up to date [puppet] - 10https://gerrit.wikimedia.org/r/735386 (https://phabricator.wikimedia.org/T290005) [13:26:43] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) (owner: 10Jelto) [13:26:56] (03Abandoned) 10Jbond: C:gitlab: drop undefined variables [puppet] - 10https://gerrit.wikimedia.org/r/735016 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [13:31:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10DAbad) As Emil's manager, I approve this request. [13:33:51] !log rollback prepend our AS on all es/knams uplinks (except NaWas) - T288505 [13:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735384 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:34:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735385 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:35:07] (03CR) 10BBlack: [C: 03+2] Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735009 (https://phabricator.wikimedia.org/T289507) (owner: 10BBlack) [13:35:49] (03CR) 10Elukey: [C: 03+2] profile::pki: add new kafka intermediate CA public cert [puppet] - 10https://gerrit.wikimedia.org/r/735384 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:35:59] (03CR) 10Elukey: [C: 03+2] role::pki::multirootca: add kafka intermediate CA config [puppet] - 10https://gerrit.wikimedia.org/r/735385 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:36:01] (03PS3) 10BBlack: Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735009 (https://phabricator.wikimedia.org/T289507) [13:36:29] (03PS4) 10BBlack: Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735009 (https://phabricator.wikimedia.org/T289507) [13:36:56] (03Abandoned) 10BBlack: Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735009 (https://phabricator.wikimedia.org/T289507) (owner: 10BBlack) [13:37:06] (03PS1) 10MMandere: hiera::common: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735389 (https://phabricator.wikimedia.org/T282787) [13:38:36] (03PS1) 10BBlack: Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735390 (https://phabricator.wikimedia.org/T289507) [13:39:43] (03CR) 10BBlack: [C: 03+2] Switch esams to digicert-2021 [puppet] - 10https://gerrit.wikimedia.org/r/735390 (https://phabricator.wikimedia.org/T289507) (owner: 10BBlack) [13:40:23] !log esams: switching unified TLS cert to digicert-2021 (natural rollout over next ~30 mins) [13:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:42] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:40:43] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:47] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [13:41:29] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) The BIOS changed to continuously boot from the NIC, failing the installation. Fixed BIOS and attempting the install again [13:42:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:32] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:42:34] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [13:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:39] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [13:44:22] (03CR) 10Jelto: [V: 03+1 C: 03+2] C:gitlab: add default variables for backup_dir_data and backup_dir_config [puppet] - 10https://gerrit.wikimedia.org/r/735381 (https://phabricator.wikimedia.org/T294435) (owner: 10Jelto) [13:48:09] (03PS1) 10Elukey: kserve: allow metrics scraping from any IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/735391 (https://phabricator.wikimedia.org/T289841) [13:57:15] (03CR) 10Elukey: [C: 03+2] kserve: allow metrics scraping from any IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/735391 (https://phabricator.wikimedia.org/T289841) (owner: 10Elukey) [13:59:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:42] (03PS2) 10MMandere: hiera::common: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735389 (https://phabricator.wikimedia.org/T282787) [14:02:15] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/735389 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:04:58] !log rolling back group1 wikis to 1.38.0-wmf.5 (T293947) due to UBN T294559 [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [14:05:06] T294559: Editing page ending with :number or going to such page via a namespace alias results in redirect to address with :number used as port - https://phabricator.wikimedia.org/T294559 [14:06:42] (03CR) 10MMandere: [C: 03+2] hiera::common: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735389 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:06:45] (03PS1) 1020after4: group1 wikis to 1.38.0-wmf.5 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735393 [14:06:47] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.38.0-wmf.5 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735393 (owner: 1020after4) [14:07:38] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.5 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735393 (owner: 1020after4) [14:08:49] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.5 refs T293947 [14:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.5 refs T293947 (duration: 01m 03s) [14:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] (03CR) 10Herron: [C: 03+2] add centrallog2002 to codfw anycast_neighbors and syslog fw allows [homer/public] - 10https://gerrit.wikimedia.org/r/731828 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [14:14:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:04] (03PS1) 10Lucas Werkmeister (WMDE): Remove tmpUseRequestLanguagesForRdfOutput Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735394 (https://phabricator.wikimedia.org/T285795) [14:16:36] (03CR) 10Juan90264: [C: 03+1] "LGTM 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733403 (https://phabricator.wikimedia.org/T288947) (owner: 10A2093064) [14:17:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:02] (03CR) 10Filippo Giunchedi: [C: 03+1] varnishrls.mtail: various optimizations [puppet] - 10https://gerrit.wikimedia.org/r/735383 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [14:25:40] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:25:51] !log [urbanecm@mwmaint1002 ~/foundationwiki-sul]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=foundationwiki --userlist users.txt # T205347, users.txt P17630 [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:58] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [14:26:58] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:53] ^ is the kubernetes1005 issue known by someone? otherwise it sounds like the kind of thing we could've accidentally impacted with drmrs config changes :) [14:29:08] not known here... [14:29:35] it's weird that it's only one host, no? [14:31:17] jayme: could be due to staggered puppet runs, rolling out breakage i guess [14:31:30] that seems about right [14:33:54] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:37:20] ^ that might be us too [14:37:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10ssingh) [14:37:41] bblack: you need something from us/some help? [14:38:04] can you remind me where we actually see the reload error for icinga at on the host? [14:38:19] or do I just have to try a manual one? [14:38:46] yeah manual one will do it, or IIRC a systemctl reload icinga will also log in the journal [14:38:47] Oct 28 14:12:30 kubernetes1005 ferm[7754]: Starting Firewall: fermiptables-restore: line 137 failed [14:38:48] Oct 28 14:12:30 kubernetes1005 ferm[7754]: Failed to run /sbin/iptables-restore [14:38:52] bblack: ^ [14:39:14] oh, icinga. sorry :) [14:40:03] jayme: curious error though, usually those failures are temporary dns failures IME [14:40:33] yeah the icinga config one is drmrs for sure [14:40:53] although via what abstraction is hard to tell! [14:41:42] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-jumbo: permit centrallog2002 via ferm [puppet] - 10https://gerrit.wikimedia.org/r/732711 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [14:42:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:42:35] yeah, restarting the service does fix that situation [14:42:53] jayme: the one on k8s1005? [14:43:22] bblack: yes, on kubernetes1005 (sorry for not being clear) [14:43:30] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:43] (03PS1) 10Urbanecm: foundationwiki: Enable Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735396 (https://phabricator.wikimedia.org/T205349) [14:46:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:46:11] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [14:46:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:23] (03CR) 10David Caro: general: some dev-related improvements (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (owner: 10David Caro) [14:46:38] (03PS1) 10MMandere: hiera::drmrs Add drmrs DC site instances [puppet] - 10https://gerrit.wikimedia.org/r/735399 (https://phabricator.wikimedia.org/T282787) [14:46:42] !log volans@cumin2002 Updating IPMI password on 16 hosts - volans@cumin2002 - T283050 [14:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:48] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [14:48:00] (03CR) 10Elukey: [C: 03+1] kafka-jumbo: permit centrallog2002 via ferm [puppet] - 10https://gerrit.wikimedia.org/r/732711 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [14:48:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::ci::master: remove old kubernetes config [labs/private] - 10https://gerrit.wikimedia.org/r/734065 (owner: 10Elukey) [14:48:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:48:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubernetes: add tokens and secrets for revscoring-articlequality [labs/private] - 10https://gerrit.wikimedia.org/r/734066 (https://phabricator.wikimedia.org/T294141) (owner: 10Elukey) [14:48:42] (03CR) 10Tacsipacsi: wikireplicas: add Translate extension tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [14:49:08] (03PS1) 10BBlack: Remove cache_canary cluster [puppet] - 10https://gerrit.wikimedia.org/r/735400 [14:49:10] (03PS1) 10BBlack: drmrs common.yaml: comment out dc for now [puppet] - 10https://gerrit.wikimedia.org/r/735401 (https://phabricator.wikimedia.org/T282787) [14:49:17] (03PS5) 10Ssingh: Add echetty to product-users and ssh access [puppet] - 10https://gerrit.wikimedia.org/r/733916 (https://phabricator.wikimedia.org/T294229) (owner: 10RhinosF1) [14:49:32] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [14:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] (03CR) 10BBlack: [C: 03+2] Remove cache_canary cluster [puppet] - 10https://gerrit.wikimedia.org/r/735400 (owner: 10BBlack) [14:50:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [14:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [14:50:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [14:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:59] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [14:51:04] (03CR) 10BBlack: [C: 03+2] drmrs common.yaml: comment out dc for now [puppet] - 10https://gerrit.wikimedia.org/r/735401 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [14:52:32] (03CR) 10Kormat: [C: 03+1] P:mariadb::grants::production: ensure we inlucde required classes [puppet] - 10https://gerrit.wikimedia.org/r/735025 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [14:53:45] (03CR) 10Jbond: general: some dev-related improvements (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735319 (owner: 10David Caro) [14:54:02] (03Abandoned) 10Jbond: README: update build local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/735317 (owner: 10Jbond) [14:54:18] (03CR) 10Ssingh: [C: 03+2] Add echetty to product-users and ssh access [puppet] - 10https://gerrit.wikimedia.org/r/733916 (https://phabricator.wikimedia.org/T294229) (owner: 10RhinosF1) [14:56:42] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:57:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10ssingh) 05Open→03Resolved @EChetty: You should have access; please let us know if something doesn't work, thanks! (Please check your WMF e... [14:58:10] sukhe: don't forget manual steps for kerbos [14:58:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mariadb::grants::production: ensure we inlucde required classes [puppet] - 10https://gerrit.wikimedia.org/r/735025 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [14:59:19] 10SRE, 10SRE-Access-Requests: Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10CDanis) 05Resolved→03Open Hi @EJoseph, looks like you re-used the same SSH key between both WMCS and production. Can you please generate a new key solely for production use? Thanks! [15:00:42] (03CR) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:00:53] Spookreeeno: yep, already done :) [15:01:08] cdanis: woah, nice catch! [15:01:43] sukhe: perfect [15:02:45] sukhe: just happened to notice the cross-validate-accounts email [15:02:57] (03PS3) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) [15:03:33] (03CR) 10Ssingh: [C: 03+1] haproxy: Allow setting variables [puppet] - 10https://gerrit.wikimedia.org/r/735324 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:03:37] !log volans@cumin2002 START - Cookbook sre.hosts.ipmi-password-reset [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:09] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:04:29] !log volans@cumin2002 Updating IPMI password on 8 hosts - volans@cumin2002 - T283050 [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:35] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [15:05:54] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [15:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:13] (03PS2) 10MMandere: hiera::drmrs Add drmrs DC site instances [puppet] - 10https://gerrit.wikimedia.org/r/735399 (https://phabricator.wikimedia.org/T282787) [15:18:18] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:18:32] (03PS1) 10BBlack: Add temporary junk IP for prometheus6001 [dns] - 10https://gerrit.wikimedia.org/r/735405 [15:20:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Marostegui) Please coordinate with us before changing the DIMM as this host has MySQL up and running (although it is nos serving production it can still get corrupted if powered off without gracefu... [15:22:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:19] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:22:20] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:26] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [15:22:57] (03PS2) 10BBlack: Add temporary junk IP for prometheus6001 [dns] - 10https://gerrit.wikimedia.org/r/735405 (https://phabricator.wikimedia.org/T282787) [15:24:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:24:15] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:21] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [15:25:55] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Fix missing_xwd ACL [puppet] - 10https://gerrit.wikimedia.org/r/735325 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:26:38] (03CR) 10BBlack: [C: 03+2] Add temporary junk IP for prometheus6001 [dns] - 10https://gerrit.wikimedia.org/r/735405 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:30:29] (03CR) 10BBlack: [C: 03+1] "We're getting closer to a live DC with every step!" [puppet] - 10https://gerrit.wikimedia.org/r/735399 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [15:30:35] (03CR) 10Ssingh: [C: 03+1] "(per the mailing list thread ;)" [puppet] - 10https://gerrit.wikimedia.org/r/735386 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:32:09] (03CR) 10MMandere: [C: 03+2] hiera::drmrs Add drmrs DC site instances [puppet] - 10https://gerrit.wikimedia.org/r/735399 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [15:33:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:38:48] (03PS1) 10Urbanecm: foundationwiki: Revoke editsitejson and editinterface from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735408 (https://phabricator.wikimedia.org/T205347) [15:38:52] jouncebot: nowandnext [15:38:53] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [15:38:53] In 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1600) [15:39:06] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Revoke editsitejson and editinterface from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735408 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [15:40:32] (03Merged) 10jenkins-bot: foundationwiki: Revoke editsitejson and editinterface from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735408 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [15:40:36] (03PS6) 10Elukey: kserve: add network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/732939 (https://phabricator.wikimedia.org/T289834) [15:41:39] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:43:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bcc910fffbad3a776aa2465740ac42e9e8ffa26c: foundationwiki: Revoke editsitejson and editinterface from users (T205347) (duration: 01m 04s) [15:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:14] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [15:44:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:44:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] (03PS2) 10Ayounsi: Remove includes for eqsin-ulsfo GRE tunnel prefix [dns] - 10https://gerrit.wikimedia.org/r/735293 (https://phabricator.wikimedia.org/T273308) [15:45:16] (03CR) 10Hnowlan: "Approach lgtm! The diffs check out for the prod config with this changed, only labels are changed. One query outstanding but seems fine to" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [15:45:56] (03CR) 10Ayounsi: [C: 03+2] Remove includes for eqsin-ulsfo GRE tunnel prefix [dns] - 10https://gerrit.wikimedia.org/r/735293 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [15:47:05] (03Abandoned) 10Hnowlan: maps1009: remove temporary overrides [puppet] - 10https://gerrit.wikimedia.org/r/715721 (owner: 10Hnowlan) [15:47:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] (03PS1) 10MMandere: realm: Add drmrs DC site IP regex [puppet] - 10https://gerrit.wikimedia.org/r/735409 (https://phabricator.wikimedia.org/T282787) [15:59:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [15:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye completed: - cloudc... [16:00:05] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1600). [16:00:05] urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:35] o/ [16:01:27] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) [16:01:46] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) 05Open→03Resolved finally done! [16:01:59] urbanecm: fyi [16:02:00] >There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form. [16:02:06] every time I login to foundation.wm.o [16:02:20] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/735409 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:02:35] Reedy: can you try it in an inkognito window (and/or clear cookies)? [16:03:28] seems to help [16:03:30] silly things [16:03:37] * Reedy connects it to SUL [16:04:28] glad to hear it works [16:04:33] That error is nearly always cookies [16:05:29] jbond: rzl: anyone to do the puppet window please? :-) [16:05:33] except when it's tokens [16:05:38] 10SRE, 10SRE-Access-Requests: Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10EJoseph) >>! In T294379#7465732, @CDanis wrote: > Hi @EJoseph, looks like you re-used the same SSH key between both WMCS and production. Can you please generate a new key solely for product... [16:05:45] (03CR) 10MMandere: [C: 03+2] realm: Add drmrs DC site IP regex [puppet] - 10https://gerrit.wikimedia.org/r/735409 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:07:41] (03CR) 10Elukey: api-gateway: generalize pathing_map (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:07:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:08:03] (03CR) 10Herron: [C: 03+2] kafka-jumbo: permit centrallog2002 via ferm [puppet] - 10https://gerrit.wikimedia.org/r/732711 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [16:08:34] (03PS1) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitely set [puppet] - 10https://gerrit.wikimedia.org/r/735410 [16:08:38] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10dcaro) \o/ [16:09:49] (03PS6) 10Elukey: api-gateway: generalize pathing_map [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [16:12:27] (03PS1) 10Elukey: api-gateway: move pathing_map config to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) [16:12:38] hnowlan: o/ added also --^ [16:13:35] (seems easier to modify in this way) [16:17:50] !log Attach BStorm (WMF)@foundationwiki to SUL (T205347) [16:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:58] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [16:19:05] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) One thing I think most of us missed is that MediaWiki AFAICT isn't actually uploading any data, it's just sending a X-Copy-From to Swift. Looking at http... [16:20:07] (03PS1) 10MMandere: ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735412 (https://phabricator.wikimedia.org/T282787) [16:21:02] (03CR) 10BBlack: [C: 03+1] ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735412 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:21:48] (03PS2) 10Urbanecm: foundationwiki: Enable Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735396 (https://phabricator.wikimedia.org/T205349) [16:21:51] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Enable Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735396 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:21:57] (03CR) 10Elukey: [C: 04-1] "Weird this doesn't work, it is not a no-op" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:22:41] (03CR) 10MMandere: [C: 03+2] ntp: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/735412 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:22:48] (03Merged) 10jenkins-bot: foundationwiki: Enable Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735396 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:24:10] !log foundationwiki: Create DB tables for translate extension (T205349) [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:17] T205349: Enable Translate extension on Governance wiki - https://phabricator.wikimedia.org/T205349 [16:27:05] trying to enable Translate, and this happens... https://www.irccloud.com/pastebin/oVdfyU8q/ [16:27:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:54] wikiadmin@10.64.48.58(foundationwiki)> explain page; shows page_lang... [16:28:01] (03CR) 10Hnowlan: [C: 03+2] "I'm happy to deploy this to staging tomorrow and then prod on Monday" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:28:13] (03PS1) 10Herron: base_packages: install netcat-openbsd by default [puppet] - 10https://gerrit.wikimedia.org/r/735413 [16:29:12] hmm, cannot reproduce anymore [16:29:13] syncing [16:29:56] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10odimitrijevic) [16:30:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:58] (03CR) 10Herron: "I was testing ipv6 firewall rules today and realized that the netcat-traditional version deployed on most hosts didn't support ipv6, but o" [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [16:31:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 78783a7624ddf4e5bfac1e255f466d3c3e36016d: foundationwiki: Enable Translate extension (T205349) (duration: 01m 25s) [16:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:11] T205349: Enable Translate extension on Governance wiki - https://phabricator.wikimedia.org/T205349 [16:32:38] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:32:43] (03Merged) 10jenkins-bot: api-gateway: generalize pathing_map [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:33:19] looks i just broke https://foundation.wikimedia.org/wiki/Special:RecentChanges somehow... [16:33:26] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:33:58] (03CR) 10Elukey: "mmm It is maybe due to the parent change to the chart, I'll re-check once it gets merged (and the chart version gets updated)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:33:58] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:34:36] (03PS1) 10RhinosF1: REST: Avoid making 'wpaccuracy' required in API requests [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) [16:34:40] (03PS2) 10D3r1ck01: REST: Avoid making 'wpaccuracy' required in API requests [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [16:34:48] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:34:59] (03CR) 10Elukey: api-gateway: generalize pathing_map (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:35:34] xSavitar: can you restore the cherry picked from commit line [16:35:40] (03CR) 10Hnowlan: api-gateway: move pathing_map config to helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:37:14] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:38:26] Spookreeeno: We created a backport at the same time? [16:38:34] wow, sorry! [16:39:50] (03CR) 10Elukey: api-gateway: move pathing_map config to helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:40:23] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:40:24] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:41:07] Spookreeeno: When you say restore, what should I do. Not sure I understand please. [16:42:59] (03PS1) 10Urbanecm: Revert "foundationwiki: Enable Translate extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735415 (https://phabricator.wikimedia.org/T205349) [16:43:01] (03CR) 10Urbanecm: [C: 03+2] Revert "foundationwiki: Enable Translate extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735415 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:43:57] (03Merged) 10jenkins-bot: Revert "foundationwiki: Enable Translate extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735415 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:44:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: REVERT: 78783a7624ddf4e5bfac1e255f466d3c3e36016d: foundationwiki: Enable Translate extension (T205349) (duration: 01m 04s) [16:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:32] T205349: Enable Translate extension on Governance wiki - https://phabricator.wikimedia.org/T205349 [16:49:13] xSavitar: look at the diff between 1 & 2 [16:50:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:46] Spookreeeno: oops, sorry about that. Not sure why it's refusing to update. I've fetched down PS1 and I can't update the PS [16:55:15] Can you just abandon this cherrypick and create a new one Spookreeeno please [16:55:55] When I saw that the main patch was merged, I clicked on cherry pick and it must have uploaded a different patch set since you seem to have done it before me. [16:56:29] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [17:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1700). [17:01:17] (03CR) 10Elukey: "Hugh: I think that CI diff is not working correctly with this change, since I am removing something from a chart (not yet publicized/avail" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735411 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [17:08:19] (03PS3) 10Ahmon Dancy: First rev of WMF docker-resource-monitor/docker-gc images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) [17:10:13] (03Abandoned) 10Elukey: hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [17:10:34] (03CR) 10Ahmon Dancy: First rev of WMF docker-resource-monitor/docker-gc images (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [17:16:39] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) Replacement part shipped. RMA below Your replacement part associated with RMA R200378121 Item # 100 has been successfully shipped. Details of which are provided below. [17:16:54] (03PS1) 10Cwhite: upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) [17:18:33] (03PS3) 10D3r1ck01: REST: Avoid making 'wpaccuracy' required in API requests [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [17:19:00] Spookreeeno: Did it via the UI, we're fine now. Thanks! [17:36:16] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:43:22] (03PS4) 10Ahmon Dancy: First rev of WMF docker-gc image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) [17:44:01] (03CR) 10Ahmon Dancy: First rev of WMF docker-gc image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [17:46:07] (03PS1) 10Volans: ipmi: allow to hide parts of the command [software/spicerack] - 10https://gerrit.wikimedia.org/r/735421 [17:57:57] (03CR) 10Andrew Bogott: [C: 03+1] start_instance_with_prefix: work around extra stderr message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731913 (owner: 10David Caro) [18:00:04] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1800). [18:00:04] urbanecm: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:49] already done [18:03:23] (03PS4) 10Herron: rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) [18:15:20] (03CR) 10Herron: [C: 03+2] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:17:26] (03CR) 10Herron: [C: 03+2] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:19:11] (03PS9) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [18:20:08] (03PS16) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [18:22:08] (03CR) 10Daniel Kinzler: REST: Avoid making 'wpaccuracy' required in API requests (031 comment) [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [18:37:08] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:33] 10SRE, 10MediaWiki-Docker: Create and publish arm64 images of wikimedia-stretch and wikimedia-buster - https://phabricator.wikimedia.org/T274140 (10hashar) [18:48:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Platform Engineering, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) a:05Papaul→03Jclark-ctr [18:49:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Cmjohnson) @wiki_willy @RobH we're going to need to purchase a replacement DIMM [18:50:33] (03PS2) 10Urbanecm: emailuser ratelimit: Use user-global rather than user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732260 (https://phabricator.wikimedia.org/T293866) [18:50:37] (03CR) 10Urbanecm: [C: 03+2] emailuser ratelimit: Use user-global rather than user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732260 (https://phabricator.wikimedia.org/T293866) (owner: 10Urbanecm) [18:50:39] 10SRE, 10SRE-Access-Requests: Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10Gehel) 05Open→03Resolved Tested and working ! Let's close. [18:51:23] (03Merged) 10jenkins-bot: emailuser ratelimit: Use user-global rather than user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732260 (https://phabricator.wikimedia.org/T293866) (owner: 10Urbanecm) [18:52:25] (03PS1) 10Ottomata: re-enable hdfs-cleaner-gobblin [puppet] - 10https://gerrit.wikimedia.org/r/735429 (https://phabricator.wikimedia.org/T287084) [18:53:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4e0200e3f5ad9c7514ddbeda6f7ee4f8b5ed2ec7: emailuser ratelimit: Use user-global rather than user (T293866) (duration: 01m 04s) [18:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] re-enable hdfs-cleaner-gobblin [puppet] - 10https://gerrit.wikimedia.org/r/735429 (https://phabricator.wikimedia.org/T287084) (owner: 10Ottomata) [18:53:15] T293866: emailuser ratelimit should make use of user-global - https://phabricator.wikimedia.org/T293866 [18:54:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] (03PS1) 10Ottomata: hdfs_cleaner - Use $ensure_timer param for hdfs-cleaner-gobblin [puppet] - 10https://gerrit.wikimedia.org/r/735430 [18:54:25] urbanecm: hi, fyi, we are getting mail from a cron at urbanecm@stat1005 because it gets sent to root somehow [18:54:38] mutante: sorry! Lemme fix it [18:54:39] (03PS2) 10Ottomata: hdfs_cleaner - Use $ensure_timer param for hdfs-cleaner-gobblin [puppet] - 10https://gerrit.wikimedia.org/r/735430 [18:54:43] no problem at all. thx [18:55:24] redirected stdout/stderr to a file instead [18:55:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] hdfs_cleaner - Use $ensure_timer param for hdfs-cleaner-gobblin [puppet] - 10https://gerrit.wikimedia.org/r/735430 (owner: 10Ottomata) [18:55:29] +1 :) [18:57:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour and hashar: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T1900) [19:00:56] (03CR) 10D3r1ck01: REST: Avoid making 'wpaccuracy' required in API requests (031 comment) [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [19:02:02] (03PS1) 10Urbanecm: foundationwiki: Use shared OAuth tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735432 (https://phabricator.wikimedia.org/T205347) [19:03:17] twentyafterfour: hashar: hey, T294544 has a patch in master, but not yet backported, and the other blocker, T294559, has a comment by MatmaRex saying this is almost certainly caused by a certain patch [19:03:18] T294544: FlaggedRevs does not work in german wiktionary - https://phabricator.wikimedia.org/T294544 [19:03:18] T294559: Editing page ending with :number or going to such page via a namespace alias results in redirect to address with :number used as port - https://phabricator.wikimedia.org/T294559 [19:03:33] should i backport those two things? [19:03:42] or did you decide to postpone train for next week anyway? [19:04:07] urbanecm: no we will go ahead if blockers can be resolved [19:04:17] so, I'll try to do so then :)) [19:04:22] (03CR) 10Urbanecm: [C: 03+2] REST: Avoid making 'wpaccuracy' required in API requests [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [19:04:47] If you want me to backport I can do so [19:04:58] i want to do a config patch anyway [19:05:03] so I don't mind doing backports too :) [19:05:10] ok thanks! [19:05:14] np [19:05:24] (03PS1) 10Urbanecm: Revert "wfParseUrl: rely on parse_url for proto-relative urls" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735332 (https://phabricator.wikimedia.org/T294559) [19:05:28] (03CR) 10Urbanecm: [C: 03+2] Revert "wfParseUrl: rely on parse_url for proto-relative urls" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735332 (https://phabricator.wikimedia.org/T294559) (owner: 10Urbanecm) [19:06:54] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Use shared OAuth tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735432 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [19:07:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:11] (03Merged) 10jenkins-bot: foundationwiki: Use shared OAuth tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735432 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [19:09:10] (03Merged) 10jenkins-bot: REST: Avoid making 'wpaccuracy' required in API requests [extensions/FlaggedRevs] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735330 (https://phabricator.wikimedia.org/T294544) (owner: 10RhinosF1) [19:10:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:37] that was quick [19:12:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 60413dcece4a9e5ec125be49315dc22ec3b85cc7: foundationwiki: Use shared OAuth tables (T205347) (duration: 01m 04s) [19:12:14] (03PS1) 10Legoktm: Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735334 [19:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:17] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [19:12:24] (03PS1) 10Legoktm: Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/735335 [19:12:33] (03PS7) 10Dzahn: wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 [19:13:11] * legoktm will wait in line [19:14:29] legoktm: if you want to deploy the hacks you just uploaded, feel free to +2, and they can ride together with my own core backport :) [19:14:43] (03CR) 10Legoktm: [C: 03+2] Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/735335 (owner: 10Legoktm) [19:14:44] ty [19:14:51] (03CR) 10Legoktm: [C: 03+2] Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735334 (owner: 10Legoktm) [19:14:59] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.6/extensions/FlaggedRevs/: 2cd2a4e62af1b66ad29adf27d761129e8fea388a: REST: Avoid making wpaccuracy required in API requests (T294544) (duration: 01m 03s) [19:15:00] np. I'll ping you when clear. [19:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:05] T294544: FlaggedRevs does not work in german wiktionary - https://phabricator.wikimedia.org/T294544 [19:15:07] I just broke ganeti4003 [19:15:14] poor ganeti4003 :( [19:15:33] i pressed it instead of the one i racked above it power button [19:15:37] totally stupid [19:15:40] (03PS10) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [19:15:43] its powering back up now [19:15:48] PROBLEM - Host bast4003 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:10] however im now tusre what else to do with it [19:16:12] PROBLEM - Host durum4002 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:16] PROBLEM - Host ganeti4003 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:16] PROBLEM - Host ncredir4002 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:32] (03CR) 10jerkins-bot: [V: 04-1] add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [19:16:43] https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node [19:16:55] pinged in traffic as well [19:16:55] > Just powercycle the host. If that works, it's probably the faster way out. Most services should anyway be set up highly available and if we got one that is not we either should set it that way or not care too much when it fails. If this works, you are done, if not keep on reading. [19:17:02] PROBLEM - Host netflow4001 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:16] so I guess we just see how services react when it comes back up? [19:17:19] i so stupid ;_; [19:17:39] we can failover the VMs but I think all of those should be HA [19:17:41] as my finger hit it i thought 'no dummy wrong server' but was too late! [19:18:02] RECOVERY - Host ganeti4003 is UP: PING OK - Packet loss = 0%, RTA = 68.47 ms [19:18:05] i am the chaos monkey [19:18:10] :D [19:18:16] (03CR) 10Jbond: [C: 03+1] "apart from ipv6 support this also improves sec a bit (arguably by obscurity) as there is no -e flag making it a bit more difficult to set " [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [19:18:40] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:19:17] who needs a resting heart rate anyhow [19:19:26] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:34] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:19:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ncredir site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:52] are the BGP/BFD issues related to the ganeti server? [19:19:56] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:09] shoiuldnt be [19:20:21] but i am in these racks so perhaps i bumped something [19:20:27] (03PS1) 10AOkoth: gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) [19:20:37] i can get back on bast4003! [19:20:43] thats good, right [19:20:46] (03PS11) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [19:20:57] hrmm [19:21:07] RECOVERY - Host bast4003 is UP: PING OK - Packet loss = 0%, RTA = 68.38 ms [19:21:13] i have no idea whats up with the cr[34] bgp things [19:21:17] topranks: you about? [19:21:23] RECOVERY - Host durum4002 is UP: PING OK - Packet loss = 0%, RTA = 74.71 ms [19:21:31] I was just checking on durum [19:21:34] i powercycled a ganeti host by mistake and now we happen to have bgp alarms [19:21:34] looks like the VMs are coming back [19:21:37] that's another VM that was running on that [19:21:38] yea [19:21:39] RECOVERY - Host ncredir4002 is UP: PING OK - Packet loss = 0%, RTA = 68.29 ms [19:21:42] good [19:21:43] PROBLEM - Check systemd state on ncredir4002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:49] (03CR) 10Jbond: [C: 03+1] "previous comment standing this is not where you need to change things. you need to add netcat-openbsd to modules/base/manifests/standard_" [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [19:21:49] RECOVERY - Host netflow4001 is UP: PING OK - Packet loss = 0%, RTA = 68.45 ms [19:21:51] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [19:22:00] ganeti is constructed to survive me, thats good [19:22:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:13] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [19:22:23] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:22:23] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [19:22:27] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [19:22:33] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [19:22:40] * legoktm is looking at ncredir4002 [19:22:43] RECOVERY - Check systemd state on ncredir4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:45] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:47] bblack: ncredir at an unexpected reboot [19:22:49] trying to start nginx [19:22:56] BGP/BFD should just be anycast things in ganeti, right? [19:22:58] because the VM host went down [19:23:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:15] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4002 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 200204 seconds left:Certificate *.wikipedia.bg valid until 2021-12-25 12:01:30 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:23:21] cool [19:23:22] oh cool [19:23:24] !log manually restarted nginx on ncredir4002 after accidental ganeti reboot [19:23:25] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4002 is OK: SSL OK - OCSP staple validity for wikimedia.is has 254194 seconds left:Certificate wikimedia.is valid until 2021-12-23 17:01:35 +0000 (expires in 55 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:29] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4002 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 272190 seconds left:Certificate *.wikispecies.net valid until 2021-12-16 10:01:04 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:23:30] well... i dont wanna do that again [19:23:32] sorry folks! [19:23:35] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4002 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 178585 seconds left:Certificate wikipedia.fi valid until 2022-01-14 08:02:00 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:23:53] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4002 is OK: SSL OK - OCSP staple validity for wikipedia.com has 351366 seconds left:Certificate wikipedia.com valid until 2021-12-17 08:01:16 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:23:53] rescheduling a bunch of alerts [19:24:12] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/compiler1001/31981/" [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [19:24:33] netflow4001 is left with something, looking [19:24:46] ifup@ens13.service is down [19:24:49] (03PS2) 10Herron: base_packages: install netcat-openbsd by default [puppet] - 10https://gerrit.wikimedia.org/r/735413 [19:24:54] cr4-ulsfo is more concerning [19:25:38] (03CR) 10Herron: base_packages: install netcat-openbsd by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [19:25:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10wiki_willy) [19:25:55] legoktm: you know, I have a guess what this is.. that VM was modified in the past in some way, like adding a second hard disk [19:26:09] and then what happens is the devices get renumbered at next reboot [19:26:13] happened before [19:26:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10wiki_willy) Procurement request created for @robh to order via T294590. Thanks, Willy [19:26:33] unless it just clears with reset-failed [19:26:50] yo [19:27:07] XioNoX: BGP CRITICAL - AS64605/IPv4: Connect - Anycast [19:27:16] cr4-ulsfo [19:27:29] and cr3-ulsfo Crit: Down: 1 [19:27:45] looking [19:27:51] there was an event in ulsfo rack that took down ganeti4003 [19:27:54] but that part is back [19:28:05] and the strange part is that about the same time we had new cr alerts there [19:28:20] event named rob's index finger [19:28:21] durum4002.ulsfo.wmnet [19:28:29] it rebooted but is back [19:28:29] sukhe: ^ [19:28:32] all VMs there rebooted [19:28:37] because ganeti4003 did [19:28:51] or not all, all that were running on ganeti4003 at the time [19:29:15] !log [netflow4001:~] $ sudo systemctl reset-failed [19:29:17] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:25] I thought durum was just the wikidough check service? [19:29:37] yes, it is [19:29:39] legoktm: yeah it is, so it's no big deal I think [19:29:56] durum should be back [19:30:06] all VMs appear to be ok [19:30:07] now [19:30:14] but not the router alerts [19:30:20] maybe it had a local config? [19:31:05] durum4002.ulsfo.wmnet [19:31:10] er 10.128.0.7 Up 0.900 0.300 3 [19:31:14] so it should recover [19:31:55] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:05] bast4003 was down but is back too [19:32:06] heh [19:32:39] if only my brain had stopped me half a secondsooner ; P [19:32:53] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:32:53] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:59] ^ rescheduled in Icinga [19:33:08] woot [19:33:23] stepping away, ping me if needed [19:33:24] so.. nothing in Icinga alerts and starts with a '4' anymore. over [19:33:27] thanks XioNoX [19:33:45] (03Merged) 10jenkins-bot: Revert "wfParseUrl: rely on parse_url for proto-relative urls" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735332 (https://phabricator.wikimedia.org/T294559) (owner: 10Urbanecm) [19:34:00] just in case: assuming it's fine to continue with MW deploys? [19:34:04] (03PS12) 10Herron: add logstash gelf relay to elastic1049 [puppet] - 10https://gerrit.wikimedia.org/r/721364 (https://phabricator.wikimedia.org/T288620) [19:34:15] urbanecm: yea, things look back to normal [19:34:21] okay, continuing then. [19:36:16] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.6/includes/GlobalFunctions.php: b517ebd29396c23f66806cb0dca1d1b330c6e5be: Revert "wfParseUrl: rely on parse_url for proto-relative urls" (T294559) (duration: 01m 03s) [19:36:19] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:23] T294559: Editing page ending with :number or going to such page via a namespace alias results in redirect to address with :number used as port - https://phabricator.wikimedia.org/T294559 [19:37:20] (03Merged) 10jenkins-bot: Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/735335 (owner: 10Legoktm) [19:37:29] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:37:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:48] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:08] legoktm: go ahead [19:38:14] (with deployment of your hack) [19:38:20] (03PS2) 10AOkoth: gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) [19:38:51] (03CR) 10jerkins-bot: [V: 04-1] gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [19:40:30] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.5/includes/libs/http/MultiHttpClient.php: Hack: temporarily log headers in MultiHttpClient (duration: 01m 02s) [19:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:31] (03Merged) 10jenkins-bot: Hack: Temporarily log headers in MultiHttpClient [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/735334 (owner: 10Legoktm) [19:41:39] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:15] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.6/includes/libs/http/MultiHttpClient.php: Hack: temporarily log headers in MultiHttpClient (duration: 01m 02s) [19:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:09] * legoktm is done [19:47:27] twentyafterfour: train should be unblocked again :) [19:47:35] can you try to promote the wikis again please? [19:48:11] urbanecm: sure, gonna start with group1 since it's rolled back [19:48:18] ack :) [19:48:35] (03PS3) 10AOkoth: gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) [19:48:46] (03PS1) 1020after4: group1 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735436 [19:48:49] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735436 (owner: 1020after4) [19:49:05] (03CR) 10jerkins-bot: [V: 04-1] gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [19:49:37] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:49:48] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735436 (owner: 1020after4) [19:50:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:03] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.6 refs T293947 [19:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:09] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [19:51:33] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:52:06] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.6 refs T293947 (duration: 01m 02s) [19:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:00] ok logstash looks clear [19:53:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:53] I've confirmed that T294559 isn't happening now [19:53:53] T294559: Editing page ending with :number or going to such page via a namespace alias results in redirect to address with :number used as port - https://phabricator.wikimedia.org/T294559 [19:53:55] (03CR) 10Jbond: [C: 03+1] base_packages: install netcat-openbsd by default [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [19:55:40] (03PS1) 10Dzahn: gitlab: ensure restore timer is enabled only on passive host [puppet] - 10https://gerrit.wikimedia.org/r/735437 [19:57:02] (03PS4) 10AOkoth: gitlab: enable backup restore timer [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) [19:57:23] (03PS2) 10Dzahn: gitlab: ensure restore timer is enabled only on passive host [puppet] - 10https://gerrit.wikimedia.org/r/735437 [19:57:32] (03PS1) 10Urbanecm: foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735438 (https://phabricator.wikimedia.org/T205347) [19:57:41] And I've got confirmation on the other blocker as well [19:57:47] sounds great! [19:58:00] ok I'm going to attempt to roll forward to group2 wikis if there are no objections [19:58:16] sounds good to me twentyafterfour [19:59:15] (03CR) 10Dzahn: "you found a bug in code I added, basically. Let me fix that here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/735437 then you can" [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [20:01:19] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:01:19] (03PS1) 1020after4: group2 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735439 [20:01:21] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735439 (owner: 1020after4) [20:02:03] (03Merged) 10jenkins-bot: group2 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735439 (owner: 1020after4) [20:03:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:19] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:03:20] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.6 refs T293947 [20:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:26] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [20:04:03] ok nothing seems to be exploding [20:05:21] that's a good sign i guess [20:05:35] twentyafterfour: mind if i squeeze https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/735438/ in? [20:06:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [20:06:12] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) a:03RobH [20:06:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:55] urbanecm: go for it [20:09:59] thanks [20:10:04] (03PS2) 10Urbanecm: foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735438 (https://phabricator.wikimedia.org/T205347) [20:10:08] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735438 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [20:10:55] (03Merged) 10jenkins-bot: foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735438 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [20:13:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c23e7ab0f88a64b8f656e06949518fb816b2dd56: foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily (T205347) (duration: 00m 55s) [20:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:11] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [20:13:18] * urbanecm done [20:15:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:15:30] (03CR) 10AOkoth: [C: 03+1] gitlab: ensure restore timer is enabled only on passive host [puppet] - 10https://gerrit.wikimedia.org/r/735437 (owner: 10Dzahn) [20:16:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:01] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:19:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:40] (03PS1) 10Urbanecm: Make foundationwiki a standard CentralAuth wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735443 (https://phabricator.wikimedia.org/T205347) [20:19:47] (03CR) 10Urbanecm: [C: 04-2] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735443 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [20:22:27] (03CR) 10Dzahn: [C: 03+2] gitlab: ensure restore timer is enabled only on passive host [puppet] - 10https://gerrit.wikimedia.org/r/735437 (owner: 10Dzahn) [20:24:06] (03CR) 10Dzahn: "noop on gitlab1001, Service[backup-restore.timer]/ensure: ensure changed 'stopped' to 'running' on gitlab2001" [puppet] - 10https://gerrit.wikimedia.org/r/735437 (owner: 10Dzahn) [20:24:17] arnoldokoth: gitlab2001: Service[backup-restore.timer]/ensure: ensure changed 'stopped' to 'running' [20:24:50] arnoldokoth: over to you, if anything goes wrong just disable puppet agent [20:24:59] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [20:28:02] (03CR) 10Dzahn: [C: 03+2] wikistats: pass php_version parameter to web class to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/733092 (owner: 10Dzahn) [20:30:13] !log gitlab2001 - re-enabled gitlab-restore-from-backup service [20:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:33:24] mutante: looks good now! [20:33:54] arnoldokoth: :) great! [20:34:22] arnoldokoth: congrats, sounds like a project resolved [20:34:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:57] \o/ [20:36:34] arnoldokoth: !log it and mention the phab T number in the log line :) [20:37:53] !log ensured gitlab restore timer is running only on passive server and re-enabled it - https://gerrit.wikimedia.org/r/c/operations/puppet/+/735437 T274463 [20:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:00] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [20:43:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Jclark-ctr) verified ports few where off adjusted ports to match netbox [20:46:17] (03PS1) 10Dzahn: wikistats: drop unneeded wikistats_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/735449 [20:48:01] (03CR) 10Dzahn: [C: 03+2] wikistats: drop unneeded wikistats_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/735449 (owner: 10Dzahn) [20:48:17] (03CR) 10Dzahn: [C: 03+2] "cloud-only and fixing broken puppet" [puppet] - 10https://gerrit.wikimedia.org/r/735449 (owner: 10Dzahn) [20:51:14] (03CR) 10Dzahn: "well.. now we have new errors on bullseye that tell use what is next and the puppet agent run finishes.. which is good.. but also a new is" [puppet] - 10https://gerrit.wikimedia.org/r/735449 (owner: 10Dzahn) [20:51:32] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) Scratch that, the requests that are slow are not using X-Copy-From: ` 2021-10-28 20:45:41 [fc3a0714-35c2-4cbc-8999-ad4c24bc17b4] mw1439 testwiki 1.38.0-wm... [20:53:23] (03CR) 10Hashar: "Ahmon said I could develop that using mediawiki/tools/train-dev so I will definitely give it a try :)" [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/723992 (owner: 10Hashar) [20:54:09] 10ops-drmrs, 10DC-Ops: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10Papaul) [20:54:24] 10ops-drmrs, 10DC-Ops: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10Papaul) p:05Triage→03Medium [20:54:55] (03PS1) 10Dzahn: wikistats::httpd: fix PHP package names with duplicate 'phpphp' string [puppet] - 10https://gerrit.wikimedia.org/r/735451 [20:55:30] (03PS1) 10Papaul: Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) [20:56:01] (03CR) 10jerkins-bot: [V: 04-1] Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [20:59:49] (03PS2) 10Papaul: Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) [21:00:47] (03CR) 10Dzahn: [C: 03+2] wikistats::httpd: fix PHP package names with duplicate 'phpphp' string [puppet] - 10https://gerrit.wikimedia.org/r/735451 (owner: 10Dzahn) [21:01:46] (03CR) 10Dzahn: [C: 03+1] "[puppetmaster1001:~] $ host 10.136.128.8" [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [21:02:19] (03CR) 10Dzahn: [C: 03+1] "but if you merge this you need to be ready to run puppet on alert* (formerly icinga*) and check it for issues" [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [21:04:06] (03CR) 10Papaul: [C: 03+2] Add new PDU for drmrs site [puppet] - 10https://gerrit.wikimedia.org/r/735452 (https://phabricator.wikimedia.org/T294597) (owner: 10Papaul) [21:08:42] (03CR) 10Dzahn: "buster is working fine again and bullseye is WIP and we are getting there" [puppet] - 10https://gerrit.wikimedia.org/r/735451 (owner: 10Dzahn) [21:13:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Dwisehaupt) @Jclark-ctr Thanks, I'll test them out. [21:19:49] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [21:22:48] ^ this is known, we are talking about it [21:22:52] a revert is coming up [21:22:57] (03PS1) 10Papaul: Revert "Add new PDU for drmrs site" [puppet] - 10https://gerrit.wikimedia.org/r/735338 [21:23:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add new PDU for drmrs site" [puppet] - 10https://gerrit.wikimedia.org/r/735338 (owner: 10Papaul) [21:24:55] (03PS2) 10Dzahn: Revert "Add new PDU for drmrs site" [puppet] - 10https://gerrit.wikimedia.org/r/735338 (owner: 10Papaul) [21:25:35] (03CR) 10Dzahn: [C: 03+1] "Error: 'mr1-drmrs' is not a valid parent for host 'ps1-b12-drmrs' (file '/etc/nagios/nagios_host.cfg', line 4794)!" [puppet] - 10https://gerrit.wikimedia.org/r/735338 (owner: 10Papaul) [21:26:44] (03CR) 10Dzahn: [C: 03+1] "This is missing the Icinga host of the drmrs router that these PDUs want to see as their "parent" somewhere. We are reverting because othe" [puppet] - 10https://gerrit.wikimedia.org/r/735338 (owner: 10Papaul) [21:28:05] (03CR) 10Papaul: [C: 03+2] Revert "Add new PDU for drmrs site" [puppet] - 10https://gerrit.wikimedia.org/r/735338 (owner: 10Papaul) [21:28:31] PDUs in drmrs missing their "parent" router in Icinga. restarting Icinga would have broken it. but we didn't and reverted [21:30:29] 10ops-drmrs, 10DC-Ops: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10Papaul) While adding the new drmrs PDU's to icinga i get the error below. so revert the changes until this is fix. ` Error: 'mr1-drmrs' is not a valid parent for host 'ps1-b12-... [21:30:45] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [21:32:29] ^ that's it for now [21:38:26] (03CR) 10Dzahn: "already enabled now automatically because it is the passive host in https://gerrit.wikimedia.org/r/c/operations/puppet/+/735437 but let me" [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [21:39:07] (03CR) 10Dzahn: "otherwise not needed anymore. BUT you can please delete the Hiera key there that disabled it and is now not used anymore, right?" [puppet] - 10https://gerrit.wikimedia.org/r/735435 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [21:39:43] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:42:05] jouncebot: next [21:42:05] In 1 hour(s) and 17 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T2300) [21:42:23] !log sudo cumin 'C:profile::mediawiki::common' "disable-puppet 'gerrit:734798 - ${USER}'" [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:00] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31966/ and cumin 'C:profile::mediawiki::common' "disable-puppet ...to carefully deploy i" [puppet] - 10https://gerrit.wikimedia.org/r/734798 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [21:46:53] !log restoreccccccvkvhgbvtklgce kkbeuvvuskljihickdbgcunljcr scheduled to run on gitlab2001 (T285867) [21:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:00] T285867: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 [21:47:17] arnoldokoth: looks like the kitty wanted to log as well [21:47:37] oops :( [21:47:46] dont worry, you can just repeat it [21:47:50] there is not really deleting though [21:48:08] well, you _can_ delete it in the wikitech wiki if you care [21:48:14] but it's also tweeted [21:48:18] !log restore script scheduled to run on gitlab2001 (T285867) [21:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:33] cool [21:48:56] deploying a change to mediawiki::common, puppet disabled on all hosts using it for a minute [21:49:04] 366 [21:52:58] 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10BBlack) I think what we're missing here is the necessary network hardware entries in `modules/netops/manifests/monitoring.pp` to crea... [21:53:42] !log re-enabled puppet on mw-api-canary [21:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:46] !log re-enabled puppet on mw-app-canary, mwmaint, labweb1002,.. [21:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:58] !log re-enabled puppet on deploy*, parse* and thenr everything else [22:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:33] !log releases1002 - releases2002, something broke puppet here about 27.5 hours ago. lookup() did not find a value for the name 'profile::docker::storage::physical_volumes' [22:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:58] (03CR) 10Dzahn: "Hey John, this doesn't seem to be exactly it but close enough (docker-related, about 27 hours ago) to ask you: on releng1002/releng2002 pu" [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [22:13:41] (03CR) 10Dzahn: P:docker::engine: ensure we include all required classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735026 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [22:15:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31988/" [puppet] - 10https://gerrit.wikimedia.org/r/735075 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [22:16:27] !log mwdebug1001 - letting puppet remove all mediawiki font packages using new Hiera key 'profile::mediawiki::webserver::install_fonts: false' to make sure we really don't need them (T294378) [22:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:34] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [22:17:05] there's something weird going on with jawiki and a "curPage" cookie on Firefox [22:17:06] https://phabricator.wikimedia.org/T294561 [22:18:01] something set the cookie to Ӑ (Ó + \u0090) [22:19:11] bblack: ^ wonder if you have any ideas [22:20:04] legoktm: maybe https://ja.wikipedia.org/wiki/MediaWiki:Gadget-protectionLog.js/core.js#L-30? [22:20:38] is that running on all page views? [22:20:43] not sure [22:20:50] tbh i'm not sure if it's running at all [22:20:53] I have the cookie somehow, trying to figure out how to get rid of it [22:20:56] (03CR) 10Dzahn: "carefully re-enabled puppet in steps and it was a noop everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/734798 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [22:21:15] (03CR) 10Dzahn: "works, it removed all the packages on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/735075 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [22:21:37] that does seem suspect [22:22:20] (03CR) 10Dzahn: "Info: Caching catalog for mwdebug1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/735075 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [22:22:39] especially if it's using a regex on multibyte characters [22:24:30] legoktm: that page runs for everyone as a gadget [22:24:50] loaded at https://ja.wikipedia.org/wiki/MediaWiki:Gadgets-definition with `protectionIndicator[ResourceLoader|default]|protectionLog.js|protectionLog.css`, see https://ja.wikipedia.org/wiki/MediaWiki:Gadget-protectionLog.js#L-191 [22:25:02] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) I merged the mediawiki::webserver change to allow absenting fonts via Hiera: https://gerrit.wikimedia.org/r/c/operations/puppet/+/734798/10 Then used it o... [22:25:13] can you disable it? [22:25:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) @Eevans just for curiosity, any reason we have no restbase hosts in row A ? [22:25:58] legoktm: sure, if there's no simple way to fix that gadget itself [22:26:03] the stuck cookie is preventing firefox from even sending the request AFAIS [22:26:10] sounds serious enough [22:27:05] done: https://ja.wikipedia.org/w/index.php?title=MediaWiki:Gadgets-definition&diff=86255655&oldid=85583168 [22:27:26] can you check that did the trick legoktm? [22:27:58] well, anyone who is currently stuck will be stuck until they delete their cookies [22:28:11] :( [22:28:55] I'm not sure how it can be fixed besides a Firefox update [22:29:20] thank you though [22:29:33] np [22:29:45] i guess there's nothing like `unset-cookie` header? [22:30:49] there is, but you have to hit the server to get that far :/ [22:31:14] wait, a cookie prohibits the client to even contact the server? [22:33:40] I'm trying to reproduce [22:35:03] legoktm: i succesfully reproduced. Go to https://ja.wikipedia.org/wiki/%D3%90, and then to https://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:%E5%80%8B%E4%BA%BA%E8%A8%AD%E5%AE%9A [22:35:06] (with that gadget enabled, ofc) [22:35:55] and it breaks the page for you? [22:36:22] correct [22:36:33] this is what i see https://usercontent.irccloud-cdn.com/file/Iv7MttKc/image.png [22:36:40] deleting the cookie fixes it [22:37:51] this is the cookie info https://usercontent.irccloud-cdn.com/file/WNSHdrny/image.png [22:38:20] hm, yours is different than mine [22:38:47] FF 93.0, ftr [22:40:35] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:31] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1264262 MB (15% inode=76%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [22:43:00] I enabled the gadget and can't get the cookie to come back [22:44:23] oh, there I did it [22:44:26] rip [22:44:33] reproduced too? [22:45:02] yep [22:45:19] in that case, I'm going back to paying 100% attention to my meeting :)) [22:46:22] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) >>! In T283076#7454868, @Jelto wrote: > I identified at least two issues which prevent us from having a successful res... [22:48:26] do that :) [22:55:15] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1264262 MB (15% inode=76%): andrew bogott I will run a find and delete some things https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [23:00:04] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211028T2300). [23:00:04] Juan_90264 and kemayo: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:51] Okay jouncebot [23:05:40] Juan_90264: are you able to test your patch? [23:06:17] Kemayo: are you around for backport? [23:06:31] thcipriani: I can't test it [23:06:36] :( [23:07:32] Juan_90264: I'm can't test either. Can you reschedule? Another deployer may know. [23:08:41] thcipriani: Yes I can, especially since today is Brennen's training. [23:09:05] TBH, I wouldn't worry about that [23:09:35] Basically, all it controls is whether __INDEX__/__NOINDEX__ will work on the pages [23:10:04] patch look OK Reedy ? [23:10:34] Yeah [23:10:59] There's a few config patches that aren't very easily testable [23:11:00] Reedy: =D [23:11:09] And when it's just changing existing config, it's less of a risk [23:11:16] If it was something completely new, it would be more concerning [23:11:21] okie doke let's do it :D [23:11:44] (03PS2) 10Thcipriani: Add User and User talk to $wgExemptFromUserRobotsControl on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733403 (https://phabricator.wikimedia.org/T288947) (owner: 10A2093064) [23:11:58] (03CR) 10Thcipriani: [C: 03+2] "config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733403 (https://phabricator.wikimedia.org/T288947) (owner: 10A2093064) [23:12:07] Let's start [23:12:44] (03Merged) 10jenkins-bot: Add User and User talk to $wgExemptFromUserRobotsControl on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733403 (https://phabricator.wikimedia.org/T288947) (owner: 10A2093064) [23:13:05] Perfect [23:13:37] (03PS4) 10Dzahn: wikitech::web: remove font packages from wikitech servers [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) [23:14:23] (03CR) 10BryanDavis: wikireplicas: add Translate extension tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [23:14:37] (03PS5) 10Dzahn: wikitech::web: remove font packages from wikitech servers [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) [23:17:27] (03PS2) 10AntiCompositeNumber: wikireplicas: add Translate extension tables [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) [23:17:29] (03CR) 10AntiCompositeNumber: wikireplicas: add Translate extension tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [23:17:59] thcipriani: ... and the Stashbot? [23:18:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:12] Juan_90264: still staging :) [23:22:41] Okay [23:26:36] Juan_90264: we'll sync your patch now [23:26:58] Great cjming [23:26:59] (03CR) 10Dzahn: "[ 2021-10-28T23:24:47 ] INFO: Compiling host cloudweb2001-dev.wikimedia.org (prod)" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:27:09] (03CR) 10Tacsipacsi: wikireplicas: add Translate extension tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [23:27:44] (03CR) 10Dzahn: "There is one other server where this was already done, today, mwdebug1001. All others still have the fonts as of right now." [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:28:12] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:733403|Add User and User talk to $wgExemptFromUserRobotsControl on zhwiki (T288947)]] (duration: 00m 56s) [23:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:20] T288947: Add User and User talk namespace to $wgExemptFromUserRobotsControl on zhwiki - https://phabricator.wikimedia.org/T288947 [23:28:53] (03PS6) 10Dzahn: wikitech::web: remove font packages from wikitech servers [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) [23:30:37] thcipriani: Code already installed and working, https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php [23:31:09] Juan_90264: great! [23:31:29] kemayo: are you still around? otherwise we're thinking to punt on your patch for now [23:31:55] (03CR) 10BryanDavis: wikireplicas: add Translate extension tables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/735088 (https://phabricator.wikimedia.org/T289952) (owner: 10AntiCompositeNumber) [23:33:18] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [23:33:30] !log end of UTC late backport & config window [23:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:12] (03PS1) 10RobH: new cp40[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/735461 (https://phabricator.wikimedia.org/T290694) [23:36:17] (03CR) 10BryanDavis: "Wikitech is not in the primary MediaWiki hosting cluster, and I'm not sure if it actually uses thumbor or shellbox as a result. Does anybo" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:38:53] (03CR) 10RobH: [C: 03+2] new cp40[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/735461 (https://phabricator.wikimedia.org/T290694) (owner: 10RobH) [23:43:15] (03CR) 10Dzahn: "This patch is partially a way of asking you guys if you think Wikitech is a special case that either "can go first to ensure nothing needs" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:43:54] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4033.ulsfo.wmnet with OS buster [23:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster [23:46:46] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [23:46:47] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [23:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:53] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [23:48:10] (03CR) 10AntiCompositeNumber: wikitech::web: remove font packages from wikitech servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [23:50:06] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4034.ulsfo.wmnet with OS buster [23:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:12] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster executed with errors: - cp4034 (*... [23:50:18] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4033.ulsfo.wmnet with OS buster [23:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster executed with errors: - cp4033 (*... [23:51:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) Not sure why these are failing, but I'm out of mental bandwidth for them today. They are remotely accessible via idrac and will accept script commands.... [23:57:08] Report from #wikimedia 30 minutes ago: [23:57:21] @wikimedia.org emails appear to be misconfigured, and not deliverable. [23:57:32] SMTP error from remote mail server after pipelined end of data: [23:57:43] 550 5.7.1 Your email has been rejected. - gcdp w20si3447117qtk.91 - gsmtp [23:57:53] i tried emailing around four wikimedia.org addresses, the error is the same