[00:01:10] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.593e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [00:01:12] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.593e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [00:01:18] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:14] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 62.03 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [00:26:12] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:20] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:20] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:44] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:58] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [01:46:02] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1186.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:02:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:22] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [02:16:52] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:44] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:08] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:44:00] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:02:52] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:58] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:58:54] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:02:54] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [04:25:56] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:20] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:32:02] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:35:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:55:08] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [05:02:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:14] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:51] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:39:28] (03PS7) 10Juan90264: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) [05:40:34] (03CR) 10jerkins-bot: [V: 04-1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [05:49:09] (03PS5) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [05:51:29] (03CR) 10Juan90264: [C: 03+1] Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [05:54:10] (03PS5) 10Juan90264: Adding square wordmark for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) [06:02:08] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:51] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [06:27:20] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:34] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 21.36 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [06:58:36] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2023.codfw.wmnet'... [06:58:43] (03CR) 10Nikerabbit: [C: 03+1] Review access change [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [07:02:12] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:30] (03PS14) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [07:07:34] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [07:08:57] (03PS1) 10Ladsgroup: Avoid calling delete() with empty arrays in PruneFRIncludeData [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714151 (https://phabricator.wikimedia.org/T289249) [07:11:26] (03CR) 10Juan90264: [C: 03+1] Add optimised square logo and wordmark for Wikimania on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [07:12:11] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [07:18:06] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2023.codfw.wmnet with reason: REIMAGE [07:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:15] (03CR) 10Ladsgroup: [C: 03+2] Avoid calling delete() with empty arrays in PruneFRIncludeData [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714151 (https://phabricator.wikimedia.org/T289249) (owner: 10Ladsgroup) [07:20:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2023.codfw.wmnet with reason: REIMAGE [07:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:08] (03CR) 10MMandere: varnish: Containerize varnish test environment (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [07:24:04] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:14] (03CR) 10MMandere: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [07:25:08] (03Merged) 10jenkins-bot: Avoid calling delete() with empty arrays in PruneFRIncludeData [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714151 (https://phabricator.wikimedia.org/T289249) (owner: 10Ladsgroup) [07:28:14] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:714151|Avoid calling delete() with empty arrays in PruneFRIncludeData (T289249)]] (duration: 00m 59s) [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [07:28:54] !log running FlaggedRevs/maintenance/pruneRevData.php on all flaggedrevs wikis [07:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:47] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2023.codfw.wmnet'] ` and were **ALL** successful. [07:39:48] (03PS1) 10Ladsgroup: Set request languages rdf output for wikidata to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714322 (https://phabricator.wikimedia.org/T285795) [07:40:06] (03CR) 10Ladsgroup: [C: 03+2] Set request languages rdf output for wikidata to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714322 (https://phabricator.wikimedia.org/T285795) (owner: 10Ladsgroup) [07:40:54] (03Merged) 10jenkins-bot: Set request languages rdf output for wikidata to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714322 (https://phabricator.wikimedia.org/T285795) (owner: 10Ladsgroup) [07:42:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:39] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714322|Set request languages rdf output for wikidata to true (T285795)]] (duration: 00m 57s) [07:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:43] T285795: Limit languages on EntityStub rdf builders - https://phabricator.wikimedia.org/T285795 [07:46:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:05] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove reading-web Grafana checks [puppet] - 10https://gerrit.wikimedia.org/r/714067 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [07:51:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:55:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:47] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:59:04] that's me ^ [07:59:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:59] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:21:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:24:19] (03CR) 10Jbond: [C: 03+1] "LGTM but see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [08:26:19] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:46:02] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/711400 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:50:48] (03CR) 10Jbond: "lgtm but see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/712193 (owner: 10Vgutierrez) [08:51:41] (03PS5) 10Jbond: logout cookbook: Quote CN and UID [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 (owner: 10Muehlenhoff) [08:55:47] (03Abandoned) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T171482) (owner: 10Filippo Giunchedi) [08:57:22] (03CR) 10Jbond: [C: 03+2] "LGTM will submit" [cookbooks] - 10https://gerrit.wikimedia.org/r/712210 (owner: 10Muehlenhoff) [09:00:48] (03PS2) 10Vgutierrez: wmflib: Adopt Cfssl::Wildcard type [puppet] - 10https://gerrit.wikimedia.org/r/712193 [09:01:33] !log pooling swift in eqiad - T288458 [09:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:38] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [09:01:49] (03PS3) 10Vgutierrez: wmflib: Adopt Cfssl::Wildcard type [puppet] - 10https://gerrit.wikimedia.org/r/712193 [09:02:30] !log filippo@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [09:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] !log filippo@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad [09:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:53] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:14] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/712193 (owner: 10Vgutierrez) [09:04:41] (03CR) 10Vgutierrez: [C: 03+2] wmflib: Adopt Cfssl::Wildcard type [puppet] - 10https://gerrit.wikimedia.org/r/712193 (owner: 10Vgutierrez) [09:11:49] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:13:43] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) (owner: 10Jcrespo) [09:14:14] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) >>! In T288458#7298092, @Legoktm wrote: >>>! In T288458#7274937, @fgiunchedi wrote: >> Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should reb... [09:14:28] (03CR) 10Jcrespo: "There is a warning, but it is not part of your code, but because of the old jinja version." [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [09:15:52] (03PS3) 10Jcrespo: puppet: Document deprecation of require_packages() on README [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) [09:16:49] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:25:33] (03CR) 10Jcrespo: [C: 03+2] puppet: Document deprecation of require_packages() on README [puppet] - 10https://gerrit.wikimedia.org/r/714079 (https://phabricator.wikimedia.org/T266479) (owner: 10Jcrespo) [09:26:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:30:31] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10jcrespo) What the best way to proceed for media backups? Should I stop operations on both dcs (because one will be rebalancing and the other will be in production)? Please advise. [09:33:07] !log filippo@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [09:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:18] !log filippo@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=codfw [09:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:00] jouncebot: now [09:35:00] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [09:35:02] jouncebot: next [09:35:02] In 0 hour(s) and 54 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T1030) [09:35:09] * urbanecm is going to get a security patch out [09:36:21] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) >>! In T288458#7300712, @jcrespo wrote: > What the best way to proceed for media backups? Should I stop operations on both dcs (because one will be rebalancing and the other will be... [09:37:15] (03PS1) 10Btullis: Add rack locations of the six new datanode servers [puppet] - 10https://gerrit.wikimedia.org/r/714331 (https://phabricator.wikimedia.org/T276239) [09:38:43] (03CR) 10H.krishna123: bernard: Add basic tox config (032 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [09:41:49] (03PS1) 10Ladsgroup: Add extra sleep option between each batch in pruneRevData.php [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714152 (https://phabricator.wikimedia.org/T289249) [09:42:14] (03CR) 10Ladsgroup: [C: 03+2] Add extra sleep option between each batch in pruneRevData.php [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714152 (https://phabricator.wikimedia.org/T289249) (owner: 10Ladsgroup) [09:43:08] (03PS7) 10H.krishna123: bernard: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) [09:43:30] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [09:43:57] Amir1: feel free to let the merge to complete, but please do not go to deployment host -- I'm secdeploying sth now. [09:44:21] sure, this is going to take a while [09:44:49] thanks [09:46:45] (03Merged) 10jenkins-bot: Add extra sleep option between each batch in pruneRevData.php [extensions/FlaggedRevs] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714152 (https://phabricator.wikimedia.org/T289249) (owner: 10Ladsgroup) [09:47:02] * urbanecm is still deploying [09:48:46] (03CR) 10H.krishna123: "Should be good now šŸ˜Š" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [09:49:02] (03PS3) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [09:49:34] (03CR) 10jerkins-bot: [V: 04-1] varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [09:50:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:12] !log Deploy security patch for T289408 [09:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:19] Amir1: all yours, thanks for the patience :) [09:55:11] thanks! [09:55:43] !log start re-import OSM planet data into maps1009 eqiad master (T288400, T288897) [09:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] T288897: Wikimedia map tiles don't show some natural features (e.g. lakes) after zoom 10 - https://phabricator.wikimedia.org/T288897 [09:55:57] T288400: New imposm3 setup provides some broken links for geoshapes when requesting linestrings - https://phabricator.wikimedia.org/T288400 [09:56:10] hnowlan: ^ [09:56:40] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:714152|Add extra sleep option between each batch in pruneRevData.php (T289249)]] (duration: 00m 58s) [09:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:44] T289249: flaggedtemplates table should not keep the whole history of all revisions - https://phabricator.wikimedia.org/T289249 [09:56:59] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 3.582e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [10:01:26] random observation: testcommonswiki isnā€™t in testwikis.dblist, I wonder if thatā€™s intentional [10:01:53] (Processor usage over 85%) firing: (2) Processor usage over 85% - https://alerts.wikimedia.org [10:02:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:29] Lucas_WMDE: I'm wondering what that dblist even is for: https://codesearch.wmcloud.org/operations/?q=testwikis&i=nope&files=&excludeFiles=&repos= returns zero usages [10:03:45] hm, thatā€™s a fair question ^^ [10:04:33] never been used in an IS.php change according to git log -S [10:04:55] mbsantos: ack, nice [10:06:24] (03PS4) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [10:06:26] maybe it's for the new version promotion tooling? [10:08:00] good point [10:08:03] https://phabricator.wikimedia.org/diffusion/MREL/browse/master/bin/deploy-promote uses that [10:09:03] Lucas_WMDE: so, to get an authoritative answer, probably speak to releng. On the other hand, "[testcommonswiki is] scheduled to be closed and deleted in December 2019" [10:09:32] well, it evidently hasnā€™t been closed and deleted [10:09:59] personally I hope it wonā€™t be, as itā€™s useful šŸ¤· [10:11:13] right [10:16:53] (Processor usage over 85%) firing: (2) Processor usage over 85% - https://alerts.wikimedia.org [10:17:51] (03PS1) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714334 (https://phabricator.wikimedia.org/T251480) [10:21:26] (03CR) 10Jbond: "Thanks for this see inline for comments,suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) (owner: 10Btullis) [10:22:33] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/710932 (owner: 10Kormat) [10:25:49] (03CR) 10Kormat: [C: 03+2] utils: Add support for Hosts: comments to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/710932 (owner: 10Kormat) [10:25:55] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:28:07] jbond: <3 [10:28:21] (03Abandoned) 10Hnowlan: maps: standardised the maps2.0 config in eqiad, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702984 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:28:44] (03CR) 10Hnowlan: [C: 03+2] tegola: remove config for decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/713899 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:29:14] (03Abandoned) 10Hnowlan: maps: disable OSM sync and tilerator in codfw [puppet] - 10https://gerrit.wikimedia.org/r/704296 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T1030). [10:31:44] (03Merged) 10jenkins-bot: tegola: remove config for decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/713899 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [10:32:24] urbanecm: I think that dblist is used in scap when they roll out the train to the testwikis [10:32:51] zabe: rather, in deploy-promote (majavah already pointed that out above) [10:35:35] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2025.codfw.wmnet'... [10:35:47] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [10:36:37] (03PS3) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) [10:36:54] I wasn't yet joined when he wrote that, thanks for pointing it out. [10:37:31] (03CR) 10jerkins-bot: [V: 04-1] maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:38:25] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30773/console" [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:40:33] (03CR) 10Jcrespo: [C: 03+1] "Check my question. Manuel will be probably be unavailable today." [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [10:41:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:44:40] (03CR) 10Jbond: mediabackups: Switch TLS certificates to PKI rather than puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [10:46:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [10:46:39] jelto: ^^^ this should be good to merge but let me know if i missed something or if yuo need more from me [10:49:43] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10jcrespo) Thanks, that's a super-useful graph I didn't know about. > it is easy enough to switch media backups to read from codfw (?) It is very easy- only one line configuration away. But cros... [10:51:33] (03PS4) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) [10:51:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:51:50] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [10:52:15] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10Majavah) [10:54:02] (03CR) 10Jcrespo: mediabackups: Switch TLS certificates to PKI rather than puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [10:55:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2025.codfw.wmnet with reason: REIMAGE [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:45] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) @kormat is this blocking something, currently there is no plan to fix this untill we upgrade to puppet server 6 (for w... [10:56:51] (Traffic bill over quota) firing: (4) Traffic bill over quota - https://alerts.wikimedia.org [10:57:19] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc2025.codfw.wmnet with reason: REIMAGE [10:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T1100). [11:00:05] Lucas_WMDE and zabe: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:20] o/ [11:00:42] letā€™s start with the community consensus revert [11:00:42] o/ [11:00:56] (03PS3) 10Lucas Werkmeister (WMDE): Revert "Enable NewUserMessage on hiwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [11:01:11] thanks Lucas_WMDE :) [11:01:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Enable NewUserMessage on hiwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [11:01:40] gives me more time to test my change ^^ [11:01:48] (i.e. test the state before deploying it) [11:02:11] (03Merged) 10jenkins-bot: Revert "Enable NewUserMessage on hiwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713860 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [11:02:33] zabe: change is on mwdebug2001, can you test it? [11:03:22] (03PS2) 10Btullis: Remove a reference to druid1001 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/714024 (https://phabricator.wikimedia.org/T255148) [11:03:31] Lucas_WMDE: looks good to me [11:03:39] ok, syncing [11:04:25] (03CR) 10Btullis: [C: 03+2] Remove a reference to druid1001 from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/714024 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [11:04:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:713860|Revert "Enable NewUserMessage on hiwiktionary" (T287091)]] (duration: 00m 57s) [11:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:01] T287091: Enable NewUserMessage on hi.wiktionary - https://phabricator.wikimedia.org/T287091 [11:05:40] (03CR) 10Btullis: [C: 03+2] Add rack locations of the six new datanode servers [puppet] - 10https://gerrit.wikimedia.org/r/714331 (https://phabricator.wikimedia.org/T276239) (owner: 10Btullis) [11:05:41] PROBLEM - Host mc2025 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:57] (03PS2) 10Lucas Werkmeister (WMDE): Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714334 (https://phabricator.wikimedia.org/T251480) [11:06:17] Lucas_WMDE: thanks for deploying :) [11:06:20] np [11:06:39] thereā€™s a minor error spike but itā€™s on frwiki so probably unrelated [11:07:33] Iā€™m still having the T289246 issue on my work laptop but on my private laptop it seems to work, so I should be able to test my change, yay [11:07:33] T289246: Unable to select backend server in WikimediaDebug extension - https://phabricator.wikimedia.org/T289246 [11:07:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714334 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [11:08:03] (03CR) 10Jbond: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [11:08:33] RECOVERY - Host mc2025 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [11:08:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2025.codfw.wmnet'] ` and were **ALL** successful. [11:08:44] (03Merged) 10jenkins-bot: Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714334 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [11:09:10] testing on mwdebug2001 [11:09:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:58] heh, and it helps if I test on one of the wikis where the change was deployed, of course [11:10:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:46] Lucas_WMDE May I ask you to change robh to jynus on topic, when you have the time šŸ„ŗ [11:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:04] I donā€™t think I have permission to change the topic hereā€¦ [11:11:10] (03PS6) 10Jbond: postgresql::user: split HBA configuration into a different define [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:11:16] jynus: if you don't mind me doing it... :-) [11:11:17] done [11:11:21] thanks, urbanecm [11:11:26] np [11:11:35] Lucas_WMDE, you seem to be op on flags... [11:11:39] Lucas_WMDE: you do https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/irc/ircservserv-config/+/refs/heads/master/channels/wikimedia-operations.toml#20 [11:11:50] ah, ok [11:11:51] (Traffic bill over quota) firing: (4) Traffic bill over quota - https://alerts.wikimedia.org [11:11:51] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:11:55] then I donā€™t *know* how to do it ^^ [11:12:05] probably /topic but my client wonā€™t let me copy the current topic easily [11:12:13] no worries [11:12:18] thank you both [11:12:29] Lucas_WMDE: you also first need to be op'ed, most clients then show some buttons to do it [11:14:07] hmm, my change isnā€™t doing what itā€™s supposed to be doingā€¦ [11:14:21] Iā€™ll spend a bit more time looking into it before rolling back [11:14:33] since we have no other deployments in the calendar for this slot [11:16:51] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [11:16:56] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/714343 (owner: 10L10n-bot) [11:17:15] (03PS5) 10RhinosF1: simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 [11:17:44] 10SRE, 10Infrastructure-Foundations, 10netops: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10ayounsi) I changed the LibreNMS check to exclude management routers. To revert once this task is done: > Processor usage over 85% EXCEPT Management routers [11:17:47] (03CR) 10jerkins-bot: [V: 04-1] simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [11:18:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:53] (03CR) 10Jbond: mediabackups: Switch TLS certificates to PKI rather than puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710491 (https://phabricator.wikimedia.org/T222113) (owner: 10Jcrespo) [11:19:17] jbond: is ^ good to put on simplelap & racktables if worth it if I can make jerkins happy [11:20:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:03] (03CR) 10Jbond: [C: 03+1] "LGTM (once ci is happy)" [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [11:21:10] RhinosF1: yep lgtm thanks [11:21:26] jbond: ty [11:25:44] where do x-wikimedia-debug logs end up again? [11:25:56] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Debug_logging says /srv/mw-log/XWikimediaDebug.log on mwlog1001 [11:26:01] but mwlog1001 no longer seems to exist [11:26:08] and mwlog1002 and mwlog2002 donā€™t have such a file [11:28:12] Lucas_WMDE: did you enable "Verbose log" in X-Wikimedia-Debug? [11:28:18] no [11:28:29] you need to do it to see the debug loggings [11:28:47] then it'll be in `/srv/mw-log/XWikimediaDebug.log` at mwlog2002 [11:28:52] aha, now the file appears [11:28:54] thanks [11:29:08] np. I made a request to verify that claim, fwiw. [11:29:47] ugh, and it turns out the thing Iā€™m testing does work (mostly) ^^ [11:30:07] ok, so Iā€˜ll sync the change [11:31:51] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [11:31:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714334|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis (T251480)]] (1/2) (duration: 00m 58s) [11:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:02] T251480: Normalize pagenames/filenames on save in Wikibase - https://phabricator.wikimedia.org/T251480 [11:33:38] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:714334|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis (T251480)]] (2/2) (duration: 00m 57s) [11:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:27] Lucas_WMDE: fwiw, normal (non-verbose non-debug) logs end up in whatever place they would end if X-Wikimedia-Debug was not used. There's also `/srv/mw-log/testwiki.log`, which has debug logs for all requests to testwikis; I'm however not sure which wiki you were testing on, and if it considers testwikidatawiki a testwiki. [11:34:43] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies - https://phabricator.wikimedia.org/T280125 (10akosiaris) 05Openā†’03Resolved a:03akosiaris This has been done. The Network policies have been migrated, the... [11:34:46] I was testing testwikidatawiki and it looked like the testwiki log only had testwiki requests [11:34:46] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) [11:35:09] !log EU backport+config window done [11:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:27] i see. Well, I hope you found the logs you needed to see Lucas_WMDE šŸ™‚ [11:36:11] yes, eventually ^^ [11:37:09] (and I removed my debug code from mwdebug2001 now, with another scap pull) [11:46:14] Lucas_WMDE: thanks for clarifying the docs! [11:47:12] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10jcrespo) a:03jcrespo [11:49:32] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:52] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10jcrespo) I will be double check that your phab accounts matches your LDAP account and enabling it right away. Please bear with me as this is a new workflow for me. [11:51:08] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10jcrespo) p:05Triageā†’03High [11:52:58] (03PS1) 10Vgutierrez: envoyproxy: Fix use_remote_address config [puppet] - 10https://gerrit.wikimedia.org/r/714350 [11:54:25] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Fix use_remote_address config [puppet] - 10https://gerrit.wikimedia.org/r/714350 (owner: 10Vgutierrez) [11:54:38] (03CR) 10Jbond: "LGTM but see comments: -1 is for the v3 puppet function api (ping me on irc if you want me to expand)" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [11:54:43] (03CR) 10Jbond: [C: 04-1] Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [11:55:13] (03PS2) 10Vgutierrez: envoyproxy: Fix use_remote_address config [puppet] - 10https://gerrit.wikimedia.org/r/714350 [11:56:02] (03CR) 10Jbond: "havn't looked at this properly yet (still catching up) but pcc fails on puppetdb https://puppet-compiler.wmflabs.org/compiler1001/30775/" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [11:56:56] (03PS1) 10Jcrespo: mediabackup: Use file resource in lower case [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) [11:57:23] (03CR) 10Jcrespo: "Will run puppet compiler to check noop." [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:58:03] (03CR) 10Jcrespo: "Side note- is discovery a good CA for this?" [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:58:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/714350 (owner: 10Vgutierrez) [11:59:06] (03CR) 10H.krishna123: bernard: Add basic tox config (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [11:59:35] (03PS8) 10H.krishna123: bernard: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) [11:59:45] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [11:59:59] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/30776/" [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [12:02:21] (03CR) 10Jcrespo: [C: 03+1] "Thanks for the update. Other repos may have the same issues copied and pasted :-)." [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [12:02:54] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:42] 10ops-eqiad: asw2-c-eqiad:ge-5/0/39 - wdqs1013 - Inbound interface errors - https://phabricator.wikimedia.org/T289483 (10ayounsi) [12:06:29] (03CR) 10H.krishna123: bernard: Add basic tox config (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [12:06:54] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) >>! In T288458#7300922, @jcrespo wrote: > Thanks, that's a super-useful graph I didn't know about. > >> it is easy enough to switch media backups to read from codfw (?) > > It is v... [12:08:23] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Ladsgroup) A month have passed. If it continues to happen. I suggest banning the user. [12:12:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [12:19:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Kormat) @jbond: Using the env var workaround for services works for the moment, so long as: - we're using upstream's binary a... [12:22:23] (03CR) 10Jbond: [C: 03+1] "LGTM (assuming pcc is happy)" [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [12:25:52] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:03] (03PS1) 10ZPapierski: Test - verify connectivity issues for Flink with accept all [deployment-charts] - 10https://gerrit.wikimedia.org/r/714355 [12:43:39] (03CR) 10Jcrespo: [C: 03+2] "Thanks for all comments and review, and also for making it so fast. Your help and kind comments are much appreciated!" [puppet] - 10https://gerrit.wikimedia.org/r/714351 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [12:44:44] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [12:51:22] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: replace redis_lock eqiad servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714049 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [12:52:09] (03Merged) 10jenkins-bot: ProductionServices: replace redis_lock eqiad servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714049 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [12:55:10] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 00m 57s) [12:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:15] T280582: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 [12:57:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:30] (03PS1) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [13:02:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:02:24] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:07] (03PS5) 10Vgutierrez: envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) [13:06:09] (03PS2) 10ZPapierski: Test - verify connectivity issues for Flink with accept all [deployment-charts] - 10https://gerrit.wikimedia.org/r/714355 [13:09:42] (03CR) 10Kormat: "I think it might be possible to only modify one service: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Mapping%20of%2" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:10:21] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [13:11:29] (03PS3) 10Vgutierrez: envoyproxy: Fix use_remote_address config [puppet] - 10https://gerrit.wikimedia.org/r/714350 [13:14:38] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Fix use_remote_address config [puppet] - 10https://gerrit.wikimedia.org/r/714350 (owner: 10Vgutierrez) [13:15:15] (03CR) 10Dzahn: "thank you for this! looks good, just a minor typo "strech" -> "stretch"" [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [13:17:36] (03PS6) 10RhinosF1: simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 [13:18:11] (03CR) 10jerkins-bot: [V: 04-1] simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [13:18:57] (03PS7) 10RhinosF1: simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 [13:19:36] mutante: should I add simplelap & racktables [13:19:40] (03PS2) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [13:20:11] (03CR) 10jerkins-bot: [V: 04-1] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:20:20] RhinosF1: not in this patch, just fix the typo please and I will merge [13:20:25] thanks for that [13:20:34] I was about to amend but let you do it then [13:20:59] (03PS3) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [13:21:45] mutante: I done the typo & fixed Jenkins [13:21:47] (03CR) 10jerkins-bot: [V: 04-1] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:21:54] I can send seperate if better [13:22:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:23:00] (03CR) 10Dzahn: [C: 03+2] simplelamp2: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713863 (owner: 10RhinosF1) [13:24:13] RhinosF1: cool, thanks:) merged. If you see the user who asked about it before the weekend.. feel free to tell them about it. they can now use bullseye [13:24:41] mutante: messaging them now over email [13:24:48] re: racktables: not sure if it will stick around until buster is out [13:24:53] perfect! [13:25:07] but I just took an old ticket to install it on miscweb2002 :p [13:25:20] (buster) [13:25:32] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:43] simplelap: sure, same fix as simplelamp2 seems good [13:25:51] (03CR) 10Nikerabbit: [C: 03+1] "We have stuff piling up: https://gerrit.wikimedia.org/r/q/owner:L10n-bot+status:open" [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) (owner: 10Hashar) [13:25:58] as long as it doesnt change anything for existing users [13:26:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [13:26:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [13:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:24] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:18] (03PS5) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [13:28:20] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:30:24] mutante: https://gerrit.wikimedia.org/r/c/operations/puppet/+/714155 for simplelap [13:30:33] (03CR) 10jerkins-bot: [V: 04-1] simplelap: support bullseye same as with simplelamp2 [puppet] - 10https://gerrit.wikimedia.org/r/714155 (owner: 10RhinosF1) [13:30:35] Yeah wasn't sure with racktables whether it was worth the effort [13:30:35] (03CR) 10Dzahn: "I'd almost say let's merge this with the parent change into one so we can compile it." [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [13:31:11] (03PS3) 10RhinosF1: simplelap: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/714155 [13:32:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:32:21] (03PS4) 10Dzahn: simplelap: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/714155 (owner: 10RhinosF1) [13:33:03] (03CR) 10Dzahn: [C: 03+2] simplelap: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/714155 (owner: 10RhinosF1) [13:33:24] (03PS1) 10MVernon: base: note that service_unit is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/714362 [13:33:45] (03PS6) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [13:34:14] RhinosF1: cheers, thx [13:34:42] mutante: np, let me know if racktables is cared about [13:35:46] (03CR) 10Dzahn: [C: 03+1] "looks good to me, nothing breaks: https://puppet-compiler.wmflabs.org/compiler1001/30777/" [puppet] - 10https://gerrit.wikimedia.org/r/714136 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [13:37:42] (03CR) 10Dzahn: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/compiler1002/30778/cumin2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [13:38:14] mutante: I found https://phabricator.wikimedia.org/T247646#5970607 from paravoid [13:38:30] Which seems to indicate buster support will still exist when racktables dies [13:39:54] (03CR) 10Vgutierrez: "NOOP on existing nodes using the V2 API: https://puppet-compiler.wmflabs.org/compiler1003/30779/" [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [13:42:03] RhinosF1: good find, I had vague memories like that but you found the right comment. sure, let's add it there as well [13:42:52] mutante: add racktables for bullseye? [13:43:03] (03CR) 10MVernon: [C: 03+1] pontoon: add thanos settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714027 (owner: 10Filippo Giunchedi) [13:43:03] RhinosF1: yea [13:43:17] Doing them [13:43:19] Then [13:43:20] same code you used before [13:43:22] ty [13:43:22] (03CR) 10Kormat: "302 Jbond" [puppet] - 10https://gerrit.wikimedia.org/r/714362 (owner: 10MVernon) [13:43:36] (03CR) 10MVernon: [C: 03+1] pontoon: add swift settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714025 (owner: 10Filippo Giunchedi) [13:43:52] saw the ping, what's up? [13:44:10] (03CR) 10MVernon: [C: 03+1] pontoon: add update swift stack rolemap [puppet] - 10https://gerrit.wikimedia.org/r/714026 (owner: 10Filippo Giunchedi) [13:44:28] paravoid: nothing much, just a link to a comment by you about racktables on bullseye [13:44:36] bullseye? [13:44:39] (that we will keep it for 5 years) [13:44:40] buster you mean? [13:44:43] Yep doing [13:44:48] actually bullseye as well [13:44:59] I don't think I said that! [13:45:34] paravoid: too late, the court has ruled [13:45:45] yeah re-reading my phab comment, doesn't look like I did :) [13:45:47] (03PS1) 10ZPapierski: Allow inter task manager communication in flink [deployment-charts] - 10https://gerrit.wikimedia.org/r/714364 [13:45:49] you said you'd like to keep it for 5 years and buster deprecates in 2023 [13:46:14] ok, then we will just not merge it [13:46:14] (03PS2) 10ZPapierski: Allow inter task manager communication in flink [deployment-charts] - 10https://gerrit.wikimedia.org/r/714364 [13:46:21] keep racktables for 5 years since sept 2018 == kill it on sept 2023 [13:46:35] or move it ahead to jul 2023 or thereabouts, when we kill the last buster [13:46:37] alright [13:46:43] RhinosF1: we don't need it after all ^ [13:46:53] I mean, I don't particularly care tbh [13:47:01] whatever is the easiest thing for y'all :) [13:47:13] mutante: that's good because gerrit is still breaking with / charecter [13:47:23] It's slightly different what code search picked up [13:47:41] (03CR) 10MVernon: prometheus: couple mysqld exporter service to mariadb service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:47:42] https://gerrit.wikimedia.org/g/operations/puppet/+/98aba78a8347a367152abbc460faa2a6f73beefa/modules/racktables/manifests/init.pp#10 [13:47:54] paravoid: less work to not [13:47:57] RhinosF1: paravoid: Ok, let's not worry about it [13:48:22] (03PS2) 10Btullis: Improve creation of pkcs12 file by checking contents [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) [13:51:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:53:08] (03CR) 10DCausse: [C: 03+2] Allow inter task manager communication in flink [deployment-charts] - 10https://gerrit.wikimedia.org/r/714364 (owner: 10ZPapierski) [13:55:33] (03Merged) 10jenkins-bot: Allow inter task manager communication in flink [deployment-charts] - 10https://gerrit.wikimedia.org/r/714364 (owner: 10ZPapierski) [13:56:25] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:48] (03CR) 10Btullis: "It's worth noting that in my testing with an-test-presto1001 we can see an effect of this bug." [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) (owner: 10Btullis) [13:57:53] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:59:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:59:54] (03CR) 10MVernon: prometheus: couple mysqld exporter service to mariadb service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [14:00:29] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. - btullis@cumin1001 [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:41] (03PS7) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [14:02:18] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:38] (03PS1) 10Urbanecm: Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) [14:10:44] (03PS4) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [14:11:35] (03CR) 10Kormat: prometheus: couple mysqld exporter service to mariadb service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [14:12:41] (03CR) 10MVernon: prometheus: couple mysqld exporter service to mariadb service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [14:14:49] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add swift settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714025 (owner: 10Filippo Giunchedi) [14:14:56] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add update swift stack rolemap [puppet] - 10https://gerrit.wikimedia.org/r/714026 (owner: 10Filippo Giunchedi) [14:15:04] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add thanos settings to swift stack [puppet] - 10https://gerrit.wikimedia.org/r/714027 (owner: 10Filippo Giunchedi) [14:17:25] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [14:19:09] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) (owner: 10Vgutierrez) [14:19:46] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) The offending task should have long been disabled. Is this still even happening? [14:26:45] (03PS6) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [14:26:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. - btullis@cumin1001 [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:53] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:29:07] !log zpapierski@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:22] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) {F34617913} This is the crontab snippet. books.php was the offending program and is cle... [14:29:24] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:31:42] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:33:59] (03PS1) 10Dzahn: miscweb: define a specific version tag for prod and staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/714368 (https://phabricator.wikimedia.org/T255148) [14:34:00] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.451e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [14:36:37] (03PS8) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [14:36:57] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) >>! In T269914#7178112, @Legoktm wrote: > @Cyberpower678 it looks much better now, so tha... [14:37:00] (03PS1) 10Btullis: Add six worker nodes that were staged into service [puppet] - 10https://gerrit.wikimedia.org/r/714369 (https://phabricator.wikimedia.org/T275767) [14:37:20] (03PS2) 10Dzahn: miscweb: define a specific version tag for prod and staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/714368 (https://phabricator.wikimedia.org/T255148) [14:37:58] (03CR) 10Dzahn: [C: 03+2] miscweb: define a specific version tag for prod and staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/714368 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [14:39:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30780/console" [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:40:48] (03Merged) 10jenkins-bot: miscweb: define a specific version tag for prod and staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/714368 (https://phabricator.wikimedia.org/T255148) (owner: 10Dzahn) [14:43:32] (03CR) 10Volans: [C: 03+1] "LGTM, minor nit in the tests. Feel free to +2 it without a second round of review once fixed." [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) (owner: 10RLazarus) [14:50:14] (03PS1) 10Urbanecm: ckbwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714371 (https://phabricator.wikimedia.org/T287867) [14:50:56] (03CR) 10Urbanecm: [C: 03+2] ckbwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714371 (https://phabricator.wikimedia.org/T287867) (owner: 10Urbanecm) [14:51:42] (03Merged) 10jenkins-bot: ckbwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714371 (https://phabricator.wikimedia.org/T287867) (owner: 10Urbanecm) [14:53:16] !log [urbanecm@mwmaint2002 /srv/mediawiki-staging/php]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=ckbwiki growthexperiments # T287867 [14:53:19] 10SRE, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10akosiaris) >>! In T277857#7288616, @MSantos wrote: > @Jgiannelos the board looks good for me. I think we should change the primary one with these changes... [14:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:23] T287867: Deploy Growth features on Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T287867 [14:53:30] 10SRE, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10akosiaris) 05Openā†’03Resolved [14:53:31] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10akosiaris) [14:54:08] !log [urbanecm@mwmaint2002 /srv/mediawiki-staging/php]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=ckbwiki --phab=T287867 # T287867 [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:35] (03Abandoned) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:56:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:23] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 26fe6d7a380d4a798f78abf0e722e36c5c63df80: ckbwiki: Enable Growth features in dark mode (T287867; 1/3) (duration: 00m 57s) [14:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] (03PS1) 10Filippo Giunchedi: o11y: port thanos-rule alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) [14:58:20] !log urbanecm@deploy1002 Synchronized wmf-config/config/ckbwiki.yaml: 26fe6d7a380d4a798f78abf0e722e36c5c63df80: ckbwiki: Enable Growth features in dark mode (T287867; 2/3) (duration: 00m 57s) [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:26] T287867: Deploy Growth features on Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T287867 [14:59:00] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:59:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 26fe6d7a380d4a798f78abf0e722e36c5c63df80: ckbwiki: Enable Growth features in dark mode (T287867; 3/3) (duration: 00m 56s) [14:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:21] * urbanecm done [14:59:22] (03PS1) 10Filippo Giunchedi: profile: remove thanos-compact alerts, ported to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714373 (https://phabricator.wikimedia.org/T288726) [14:59:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:54] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [15:03:58] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:07:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) (owner: 10Btullis) [15:07:24] (03PS1) 10Herron: logstash: route alertmanager alerts to logstash alerts index [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) [15:07:43] (03PS2) 10Urbanecm: Growth features: Promote 9 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714366 (https://phabricator.wikimedia.org/T287871) [15:11:25] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MPhamWMF) p:05Triageā†’03High [15:12:02] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Gehel) [15:12:22] 10SRE, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10Gehel) 05Openā†’03Declined superseeded by T289135 [15:13:16] (03PS1) 10Volans: reports/puppetdb: support WMF standard configs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/714377 (https://phabricator.wikimedia.org/T284614) [15:18:27] (03CR) 10Bstorm: [C: 04-1] "Port grabber is what allows gridengine webservices to work. It's tricky to find where it is used, but basically when you start a service o" [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [15:19:26] (03PS2) 10Jbond: base: note that service_unit is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/714362 (owner: 10MVernon) [15:20:21] (03CR) 10Jbond: [C: 03+1] "lgtm (made a minor edit)" [puppet] - 10https://gerrit.wikimedia.org/r/714362 (owner: 10MVernon) [15:22:28] (03CR) 10Bstorm: "Basically, I highly suspect that, while it is duplicated in places in webservice, it is still used and needed. I tried to remove it once f" [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [15:24:11] (03CR) 10MVernon: [C: 03+2] base: note that service_unit is deprecated [puppet] - 10https://gerrit.wikimedia.org/r/714362 (owner: 10MVernon) [15:25:33] (03PS9) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [15:25:35] (03PS7) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [15:25:37] (03PS6) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [15:25:39] (03PS6) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [15:25:41] (03PS6) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [15:25:43] (03PS7) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [15:25:48] (03PS8) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [15:25:52] (03PS8) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [15:25:56] (03PS7) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [15:26:00] (03PS7) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [15:26:04] (03PS7) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [15:26:08] (03PS2) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [15:26:12] (03PS1) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) [15:26:16] (03PS1) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) [15:26:20] (03PS1) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [15:26:30] (03CR) 10Majavah: toolforge: remove portgrabber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [15:26:42] (03CR) 10Volans: "I might be late for the party but I'd like to challenge the choice of the cumin hosts for testing purposes. Is there any reason to not use" [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [15:27:15] (03CR) 10Jbond: [C: 03+1] "lgtm optional nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [15:34:13] (03PS2) 10Platonides: Add Wikimedia ES to $wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714211 (https://phabricator.wikimedia.org/T289446) [15:34:59] (03PS2) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [15:38:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:38:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:30] (03PS1) 10Jbond: gitlab cas: update uid field to use uid not CN [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) [15:39:56] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-dsaez-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:10] (03CR) 10Jbond: [C: 04-1] "-1 as this could be undesirable" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [15:40:49] hmmm [15:41:52] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:26] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:00] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:45:39] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [15:46:27] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [15:48:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:43] (03CR) 10Filippo Giunchedi: "Thank you for the quick action on this, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron) [15:57:14] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10jcrespo) 05Openā†’03Resolved Sorry to make you wait- I was making sure I was familiar with the different LDAP fields, double checking your identity matched the request and validating your email on LDA mat... [15:58:29] (03PS2) 10Herron: logstash: route alertmanager alerts to logstash alerts index [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) [15:59:50] (03CR) 10Herron: logstash: route alertmanager alerts to logstash alerts index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron) [16:02:30] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2027.codfw.wmnet'... [16:05:00] (03Abandoned) 10DCausse: Test - verify connectivity issues for Flink with accept all [deployment-charts] - 10https://gerrit.wikimedia.org/r/714355 (owner: 10ZPapierski) [16:05:06] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [16:05:12] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10Legoktm) 05Openā†’03Resolved I don't see any incidents of this in the last 30 days, happy to call this... [16:05:40] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Papaul) ` papaul@fasw-c-codfw# run show interfaces ge-0/0/5 terse Interface Admin Link Proto Local Remote ge-0/0... [16:07:08] 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10odimitrijevic) p:05Triageā†’03High [16:08:09] 10SRE, 10LDAP-Access-Requests: Enable CAS U2F for Majavah - https://phabricator.wikimedia.org/T289477 (10Majavah) Seems to work. Thanks! [16:10:00] (03CR) 10Dzahn: [C: 03+1] "This is just about where the crons run that connect to other servers. The actual test will be performed against canaries but it needs to c" [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [16:10:53] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10thcipriani) [16:11:25] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10thcipriani) [16:13:51] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10thcipriani) >>! In T209149#729... [16:20:37] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10thcipriani) Adding @dduvall per triage meeting [16:21:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2027.codfw.wmnet with reason: REIMAGE [16:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:10] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc2027.codfw.wmnet with reason: REIMAGE [16:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:54] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10thcipriani) 05Openā†’03Resolved >>! In T287570#7251342, @Joe wrote: > The code should now be deployed when merged/... [16:28:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10thcipriani) [16:32:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata/hosts.pp: eqiad memcached refresh cleanup [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [16:32:24] PROBLEM - Host mc2027 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:52] RECOVERY - Host mc2027 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [16:35:25] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2027.codfw.wmnet'] ` and were **ALL** successful. [16:37:18] !log ebernhardson@deploy1002 Started deploy [search/airflow@32f5039]: Add pyarrow lib for hdfs integration [16:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:54] !log ebernhardson@deploy1002 Finished deploy [search/airflow@32f5039]: Add pyarrow lib for hdfs integration (duration: 00m 35s) [16:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:46] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Papaul) [16:51:52] (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/compiler1001/30783/" [puppet] - 10https://gerrit.wikimedia.org/r/714050 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [16:54:15] (03PS1) 10Papaul: DNS: Add DNS entries for frdb2003 [dns] - 10https://gerrit.wikimedia.org/r/714384 [17:00:00] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10BTullis) @herron - I have a quick question. In the task description it says: > Temporary fork role::kafka::monitoring to role::kafka::... [17:00:05] ryankemper: #bothumor I ļæ½ Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T1700). [17:02:14] (03CR) 10Papaul: [C: 03+2] DNS: Add DNS entries for frdb2003 [dns] - 10https://gerrit.wikimedia.org/r/714384 (owner: 10Papaul) [17:05:56] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Papaul) [17:06:32] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Papaul) 05Openā†’03Resolved @Jgreen this is complete on my end . Let me know if you have any issues. [17:12:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:13:29] (03PS1) 10Effie Mouzeli: mcrouter: add option to remove certificatre expiring alerts [puppet] - 10https://gerrit.wikimedia.org/r/714386 [17:14:35] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: add option to remove certificatre expiring alerts [puppet] - 10https://gerrit.wikimedia.org/r/714386 (owner: 10Effie Mouzeli) [17:16:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:16:50] !log ebernhardson@deploy1002 Started deploy [search/airflow@4c49df7]: ship modern pip/wheel version to support manylinux2014 (pyarrow) [17:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:08] (03PS5) 10Ryan Kemper: Elasticsearch cookbooks: Represent ops as enum [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [17:17:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:17:46] !log ebernhardson@deploy1002 Finished deploy [search/airflow@4c49df7]: ship modern pip/wheel version to support manylinux2014 (pyarrow) (duration: 00m 56s) [17:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:55] 10SRE, 10ops-codfw: Move YubiHSM from auth2001 to pki2001 - https://phabricator.wikimedia.org/T281459 (10Papaul) 05Openā†’03Resolved a:03Papaul @MoritzMuehlenhoff complete [17:43:36] (03PS1) 10Ebernhardson: airflow: Provide hadoop environment defaults [puppet] - 10https://gerrit.wikimedia.org/r/714391 [17:50:41] (03PS2) 10Ebernhardson: airflow: Provide hadoop environment defaults [puppet] - 10https://gerrit.wikimedia.org/r/714391 [17:52:25] (03PS4) 10Krinkle: tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 [17:52:34] (03CR) 10Krinkle: [C: 03+2] tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [17:53:18] (03Merged) 10jenkins-bot: tests: Improve testCrossDcCompatibility to catch mismatching types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713911 (owner: 10Krinkle) [17:57:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I ļæ½ Unicode. All rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T1800). [18:00:04] dancy: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:49] * urbanecm is around, but i guess dancy will self-service? [18:01:03] Yeah, I'll take care of my commits. [18:01:09] great! [18:01:43] starting now. [18:02:15] (03CR) 10Ahmon Dancy: [C: 03+2] Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [18:04:07] (03PS1) 10Gergő Tisza: Add Link: store when tasks were generated [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714158 (https://phabricator.wikimedia.org/T284551) [18:05:03] 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10mforns) a:03Mholloway Just for the record, some discussion has happened in https://gerrit.wikimedia.org/r/c/m... [18:05:11] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [18:05:13] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Jdforrester-WMF) [18:06:05] added another patch. I can deploy it. [18:06:49] šŸ‘šŸ¾ I'll let you know when I'm done. [18:07:15] I have 3 patches but you're welcome to slip in between any of them if you're time-constrained. [18:08:15] (03Merged) 10jenkins-bot: Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [18:11:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:00] !log dancy@deploy1002 Synchronized wmf-config/etcd.php: Config: [[gerrit:713704|Allow protocol for etcd server to be specified]] (duration: 00m 57s) [18:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:34] (03CR) 10Ahmon Dancy: [C: 03+2] Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [18:12:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:16] (03CR) 10RLazarus: [C: 03+2] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [18:16:41] (03CR) 10RLazarus: [C: 03+2] httpbb: Add hourly test runs via systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/714136 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [18:18:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:19:01] (03PS3) 10Ahmon Dancy: Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 [18:19:03] (03PS5) 10Ahmon Dancy: wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 [18:19:23] (03CR) 10Ahmon Dancy: Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [18:19:27] (03CR) 10Ahmon Dancy: [C: 03+2] Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [18:20:40] (03Merged) 10jenkins-bot: Use array format to specify etcd server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713906 (owner: 10Ahmon Dancy) [18:22:09] (03Merged) 10jenkins-bot: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [18:23:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:30] (03CR) 10Ahmon Dancy: [C: 03+2] wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 (owner: 10Ahmon Dancy) [18:23:39] (03PS1) 10Herron: cumin: update kafka::monitoring alias [puppet] - 10https://gerrit.wikimedia.org/r/714398 (https://phabricator.wikimedia.org/T252773) [18:23:50] !log dancy@deploy1002 Synchronized wmf-config: Config: [[gerrit:713906|Use array format to specify etcd server]] (duration: 00m 57s) [18:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:17] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) I opted to remove `role::kafka::monitoring` in favor of `role::kafka::monitoring_buster` so the config woul... [18:24:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:57] (03Merged) 10jenkins-bot: wmfSetupEtcd only supports array input [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713907 (owner: 10Ahmon Dancy) [18:27:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:32] !log dancy@deploy1002 Synchronized wmf-config/etcd.php: Config: [[gerrit:713907|wmfSetupEtcd only supports array input]] (duration: 00m 57s) [18:27:35] tgr: I am done. [18:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:43] thanks! [18:28:01] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: store when tasks were generated [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714158 (https://phabricator.wikimedia.org/T284551) (owner: 10Gergő Tisza) [18:34:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:03] (03PS3) 10RLazarus: icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) [18:35:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:55] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10dancy) Hi @kostajh. I'd like... [18:38:52] I'm curious if some folks could try to reproduce this WikimediaDebug issue? https://phabricator.wikimedia.org/T289246 [18:39:05] (03PS2) 10Effie Mouzeli: mcrouter: add option to remove certificatre expiring alerts [puppet] - 10https://gerrit.wikimedia.org/r/714386 [18:39:14] Ping me if you you tried and did/didn't get the glitch. [18:40:09] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) @Joe As of `docker-registry.wikimedia.org/restricted/mediawiki-webserver:2021-08-04-134912-webserver` it lo... [18:41:57] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) [18:42:05] (03PS4) 10RLazarus: icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) [18:42:51] Krinkle: I use firefox as my main browser, and I'm able to use the extension. HOWEVER, ocasionally, when deploying, it just insists on mwdebug1001 being the correct server. When that happens, I only need to turn it off and on again, and it starts to work. [18:44:43] (03Merged) 10jenkins-bot: Add Link: store when tasks were generated [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714158 (https://phabricator.wikimedia.org/T284551) (owner: 10Gergő Tisza) [18:45:28] Krinkle: Using Version 92.0.4515.107 (Official Build) (64-bit).. no problems switching servers. [18:45:34] +Chrome [18:46:16] (03CR) 10RLazarus: [C: 03+2] icinga: Use shlex to quote the command string for bash -c. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) (owner: 10RLazarus) [18:47:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:43] (03Merged) 10jenkins-bot: icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) (owner: 10RLazarus) [18:54:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10RLazarus) Leaving this open until the next version of Spicerack is released with the fix, then we... [18:56:12] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/GrowthExperiments: Backport: [[gerrit:714158|Add Link: store when tasks were generated (T284551)]] (duration: 00m 57s) [18:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:19] T284551: Maintenance script for updating recommendations to newer dataset - https://phabricator.wikimedia.org/T284551 [18:56:44] !log morning deploys done [18:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:19] Krinkle: also had no issues on Chrome 92 / Ubuntu [18:58:17] (I use the light theme) [18:58:55] thanks all [18:59:25] (03CR) 10Ryan Kemper: [C: 03+2] airflow: Provide hadoop environment defaults [puppet] - 10https://gerrit.wikimedia.org/r/714391 (owner: 10Ebernhardson) [19:00:59] (03CR) 10Jbond: "> Patch Set 17: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [19:05:31] (03PS4) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [19:13:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:39] (03CR) 10RLazarus: hieradata: Run httpbb hourly from cumin2001 against a codfw appserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714137 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [19:17:56] (03PS1) 10Michael DiPietro: remove lvm requirement [puppet] - 10https://gerrit.wikimedia.org/r/714405 [19:20:32] (03CR) 10Andrew Bogott: "This should be fine -- we aren't planning to build any new hosts with lvm and existing ones both already have it set up and also have the " [puppet] - 10https://gerrit.wikimedia.org/r/714405 (owner: 10Michael DiPietro) [19:20:40] (03CR) 10Andrew Bogott: [C: 03+1] remove lvm requirement [puppet] - 10https://gerrit.wikimedia.org/r/714405 (owner: 10Michael DiPietro) [19:20:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:37] legoktm: do you use a dark theme? [19:22:59] If so, it seems the WD bug might be specific to Linux with dark mode. [19:23:57] (03CR) 10Michael DiPietro: [C: 03+2] remove lvm requirement [puppet] - 10https://gerrit.wikimedia.org/r/714405 (owner: 10Michael DiPietro) [19:28:18] (03PS3) 10Herron: logstash: route alertmanager alerts to logstash alerts index [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) [19:28:49] (03CR) 10Herron: logstash: route alertmanager alerts to logstash alerts index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714374 (https://phabricator.wikimedia.org/T289356) (owner: 10Herron) [19:42:17] (03PS1) 10Michael DiPietro: remove lvm [puppet] - 10https://gerrit.wikimedia.org/r/714409 [19:49:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:52:36] (03CR) 10Bstorm: toolforge: remove portgrabber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [19:57:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:23] (03PS1) 10Andrew Bogott: disable_tool: include $projectname in the config file [puppet] - 10https://gerrit.wikimedia.org/r/714412 [20:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services ā€“ Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T2000). [20:00:46] (03CR) 10Andrew Bogott: [C: 03+1] remove lvm [puppet] - 10https://gerrit.wikimedia.org/r/714409 (owner: 10Michael DiPietro) [20:02:39] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) >>! In T209149#730274... [20:04:22] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: include $projectname in the config file [puppet] - 10https://gerrit.wikimedia.org/r/714412 (owner: 10Andrew Bogott) [20:06:41] (03CR) 10Bstorm: toolforge: remove portgrabber (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [20:08:55] (03PS1) 10Ahmon Dancy: profile::releases:common: Install emacs on releases servers [puppet] - 10https://gerrit.wikimedia.org/r/714414 [20:11:56] (03CR) 10Ahmon Dancy: [C: 03+1] "Please merge. šŸ˜Š" [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [20:17:25] (03CR) 10Herron: [C: 03+2] cumin: update kafka::monitoring alias [puppet] - 10https://gerrit.wikimedia.org/r/714398 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [20:19:13] (03CR) 10Bstorm: "The more I'm reading old patches and stuff, the more I'm thinking I might agree that "nobody is actually using this now". I'm adding Bryan" [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [20:22:12] "legoktm: do you use a dark theme..." <- I do, yes [20:33:48] (03CR) 10Herron: [C: 03+1] o11y: port thanos-rule alerts from Icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/714372 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [20:34:57] (03CR) 10Herron: [C: 03+1] profile: remove thanos-compact alerts, ported to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714373 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [20:54:39] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30788/console" [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [20:56:02] (03CR) 10RLazarus: [V: 03+1 C: 03+2] mediawiki: Absent logrotate for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [20:56:21] (03PS4) 10RLazarus: mediawiki: Absent logrotate for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713502 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [20:57:33] hmm [21:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T2100). [21:01:37] (03PS1) 10Legoktm: debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714418 (https://phabricator.wikimedia.org/T289246) [21:13:57] xD [21:24:47] (03CR) 10BryanDavis: toolforge: remove portgrabber (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [21:25:05] 10SRE, 10Traffic: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) [21:25:26] 10SRE, 10Traffic: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) [21:25:29] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [21:25:36] 10SRE, 10Traffic: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) p:05Triageā†’03High [21:26:33] 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10dancy) [21:29:01] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10dancy) I'll work with my team... [21:36:07] (03PS1) 10Legoktm: services_proxy: Add mwapi envoyproxy for MediaWiki-internal requests [puppet] - 10https://gerrit.wikimedia.org/r/714420 (https://phabricator.wikimedia.org/T288848) [21:41:57] https://codesearch.wmcloud.org/operations/?q=proxy&i=fosho&files=&excludeFiles=&repos=Wikimedia%20MediaWiki%20config <-- I wonder why we don't set a plain $wgHTTPProxy and instead configure it per-extension/service [21:54:17] legoktm: I think it varies. The 2-3 cases I spot checked last week when we talked about it did leverage the default (e.g. Flickr API, Machine translation, ..) [21:54:51] MachineVision and MediaModeration are fairly new, might have unintentionally established a new pattern. [21:54:59] Although I see RSS and TorBlock did so as well, unclear why. [21:55:42] I figured it out mostly, it's so actually internal requests like shellbox.discovery.wmnet work [21:56:01] except we don't do that anymore, everything goes through envoy [21:57:35] legoktm: I'm not sure I follow. Do you mean that for RSS, the fact that it sets the (same) urldownloader proxy through a method parameter (used to) make it bypass that proxy parameter if the url was an internal hostname or IP? [21:59:45] so there's no global proxy set, $wgHTTPProxy is empty string. If you want to make an external HTTP request, you have to manually tell MWHttpRequest to use a proxy (RSS, TorBlock, etc.) I assume the reason we didn't set a global proxy is so that internal requests to *.wmnet, etc. wouldn't get caught up in it. Of course, I don't understand why *.wmnet wasn't added to $wgLocalVirtualHosts [22:02:34] We have 3 generic categories of requests now, 1) external, need to go over urldownloader. 2) internal, already using envoy (mostly services) 3) internal, hitting MW itself - will use $wgLocalHTTPProxy [22:10:35] legoktm: hm.. I'm still not quite getting how envoy relates to the conversation. [22:11:01] for category 2, what does use or disuse of envoy change? [22:11:10] it'd be foow.mnet vs localhost:1234 [22:11:38] I think adding wmnet to LocalVirtualHosts is a problem. but I'l get to that in a bit. [22:11:48] nothing, just that conceptually all of our outbound HTTP requests are going over some form of proxy now [22:11:55] ah okay [22:12:30] so the fact that we use internal HTTP services indeed explains why we don't set wgHTTPProxy in general and instead opt-in locally for known-external use cases. [22:13:54] yeah [22:14:11] so we could either A) keep opting in locally but do so by passing some shared config variable (is that what you're proposing?) one that is in core but somehow isn't wgHTTPProxy. or B) decide how to opt-out wmnet from wgHTTPProxy in a way that doesn't involve wgLocalVirtualHosts [22:15:02] So the reason for wgLocalVirtualHosts is because afaik those are meant to Vhosts on the same exact server as MW (localhost on same port), generally very closely correlated with sitematrix I think. [22:15:23] B is what I had in mind [22:15:26] I think we used to rely on that in some cases [22:16:29] specifically thinking about this comment as well which seems to support that understanding historically: [22:16:41] > wmf-config:// Do not add wikimedia.org, because of other sites under that domain (such as codereview-proxy.wikimedia.org) [22:16:47] In theory, we should not be making requests directly to *.wmnet anymore. I think if we taught MWHttpRequest to use no proxy for "localhost" that might just be enough [22:16:53] presumablu it would have been fine to reach that service when it existed without squid. [22:17:03] hm, yeah [22:17:29] also, if LocalVirtualHosts only mattered to HTTP Req (it doesn't), we wouldn't be setting it at all in prod [22:17:36] since we don't have wgHTTPProxy enabled [22:17:44] heh, yeah [22:17:51] https://codesearch.wmcloud.org/search/?q=LocalVirtualHosts&i=nope&files=&excludeFiles=&repos= [22:17:55] having said all that.. [22:18:18] (03PS1) 10Andrew Bogott: Added wmcs-novastats-cephleaks script [puppet] - 10https://gerrit.wikimedia.org/r/714427 (https://phabricator.wikimedia.org/T289502) [22:18:26] (03CR) 10Bstorm: "I'm running a grep -r on the toolforge nfs. I think if "portmapper" fails to find a match, we might not even need an announcement? I tried" [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [22:18:35] there is literally not one use of it variable apart from that one specific use in MWHttpReq [22:19:26] I think it used to be used in WikiMap.php [22:19:32] trying to figure out what happened [22:19:34] oh, when it was part of $wgConf? [22:19:51] yeah, but you'd expect more use of wgLocalVirtualHosts if that was the case [22:19:54] since it moved /to/ wgLocalVirtualHosts [22:20:09] (03CR) 10Andrew Bogott: [C: 03+2] Added wmcs-novastats-cephleaks script [puppet] - 10https://gerrit.wikimedia.org/r/714427 (https://phabricator.wikimedia.org/T289502) (owner: 10Andrew Bogott) [22:20:25] https://github.com/wikimedia/mediawiki/commit/b951485c24b2c8673cfabf395e6a2937a7f81331 [22:20:45] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) My deployment plan is: * Turn on envoy proxy nowish, test various requests with curl manually * En... [22:21:25] https://github.com/wikimedia/mediawiki/commit/15cb57b3b5c5b665f4602baccfc320701b3ef6aa [22:22:12] So... the one use of it that I'm aware of (ChronologyProtector, about a multi-dc use case that never went beyond theoretical afaik), actually switched back to wgConf in a different way [22:22:22] by using wgConf getAll('wgCanonicalServer') [22:22:24] (03PS1) 10Andrew Bogott: Move wmcs-novastats-cephleaks.py to the place puppet looks for it. [puppet] - 10https://gerrit.wikimedia.org/r/714428 [22:23:08] that code might be remove-able depending on how we end up figuring out the last details of multi-dc [22:23:31] but either way, seems like wgLocalVirtualHosts could just be renamed to something like wgHttpNoProxyDomains or some such [22:23:39] (03CR) 10Andrew Bogott: [C: 03+2] Move wmcs-novastats-cephleaks.py to the place puppet looks for it. [puppet] - 10https://gerrit.wikimedia.org/r/714428 (owner: 10Andrew Bogott) [22:24:02] and perhaps include a constant 'localhost' in its consumer at run-time. [22:24:22] well now it uses $wgLocalHTTPProxy [22:24:39] right, so it's not noProxy.. [22:24:40] maybe we need some kind of generic domain => proxy map instead [22:25:06] I think we can postpone that additional abstraction. [22:25:19] oh, yeah [22:25:28] what we have for now is fine for mw-on-k8s I think [22:25:30] wgHttpLocalDomains might be a better name [22:26:22] if you enable a default proxy, all external domains go through it. local domains is empty by default but you can set it to stuff relating to things you've configured that you don't want proxied. [22:26:38] and individual call sites can also continue to opt-out of any proxy if they don't want it for another erason (noProxy=true). [22:27:01] I think that covers everything one could possibly want, and surely someone a decade from now can tell me if I turn out to be wrong. [22:30:05] legoktm: OK, i've now managed to un-understand why we need(ed) wgLocalHTTPProxy [22:30:19] haha [22:30:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:30:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:30:41] Your commit message says it is because k8s blocks outgoing traffic even if it's LAN. [22:30:50] But that's a good thing. [22:31:03] requests to "https://meta.wikimedia.org/" or "https://commons.wikimedia.org/" need to be routed via envoy [22:31:15] so now they'll have a proxy set to localhost:6501 [22:31:15] why would we want to wildcard allow it by forcing it on a generic proxy set up for this purpose instead of figuring out what's violating it and giving it an envoy port. [22:31:35] ah, right, we're back to the original use of LocalVHosts. [22:31:48] mostly so we don't have to hunt down each thing individually [22:32:05] which from a functional perspective doesn't need to bypass the proxy for extenral requests [22:32:21] but we want to anyway for performance and reliability (different rate limits etc.) [22:33:00] right, and we can't actually configure the things using wiki domains to an envoy port since it's not really configuration. [22:33:28] I imagine there's numerous use cases that probalby can't always be set somewhere to a localhost address. [22:33:38] okay, good. so we still need that. [22:35:05] and then to be able to have a default proxy for external domains you need to opt-out internal non-wikifarm domains [22:35:18] I think we'd want that to be a noproxy treatment though, not a localproxy treatment. [22:35:23] Both should be fine, right? [22:35:58] yes [22:36:07] I don't think we want random wmnet things, especially if unintentional, to get out via the generic envoy proxy. That's assuming it would even work, which it probably wouldn't. [22:36:13] right [22:36:37] e.g. if this generic port goes back to the same or another mw pod (as it should imho) [22:36:37] and in theory all the "internal non-wikifarm domains" should already be using "http://localhost:$PORT/" envoy proxies. [22:36:49] exactly [22:37:07] So before we can enable a default http proxy we need to add a wg var for noProxy default? [22:37:51] while we won't use wmnet anymore, I think it's reasonable for third parties to configure some features to connect with an internal service that isn't on a localhost port. [22:38:02] yeah [22:38:23] that might explain why we haven't set a defaul tproxy and by extent why perhaps nobody else has [22:38:36] we could check with Fandom to ask what they do [22:38:37] * Krinkle does so [22:38:46] it's on the ticket [22:39:05] https://phabricator.wikimedia.org/T288848#7292923 [22:39:15] "we ended up creating an HttpRequestFactory service override to dynamically use the current k8s service as proxy for request URLs known to belong to wikis on our platform" [22:39:16] Ah, I see. [22:39:24] I read that but didn't internalise it properly [22:39:34] I was wondering why they didn't just use LocalVHost [22:39:36] which is basically $wgLocalHTTPProxy without modifying core [22:39:42] right [22:39:54] I still wonder if they use a default http proxy though [22:41:04] not sure about that [22:49:04] legoktm: ok, so in light of your plan and the above convo, I think LocalVHosts is well-named as is, since LocalProxy wouldn't be a generic go-anywhere-proxy at least for WMF [22:49:21] but only aimed at setting host headers for what is functionally a loopback (may or may not load-balance but doesn't matter) [22:50:54] TODO: Document this in a brief and comprehensible way in a discoverable place (MWHttpReq class comment, ref linked from various other places?) [22:51:31] right, though here it doesn't actually loopback, it routes the request back to the api cluster (which means k8s is technically dependent on appservers(??)) [22:52:09] haha [22:52:28] I didn't realize that last part until now, I might have the localhost:6501 proxy in k8s route back to the mwdebug cluster [22:52:39] right [22:53:00] I guess it's difficult to choose between app, api_app and {,k8s} equiv. [22:53:16] maybe keeping it within the same cluster would be best yeah [22:53:30] although doing that consistently might be non-trivial with re-usable configs [22:53:48] which reminds me, I still am unable to make out heads or tails of how this is all wired together for k8s [22:54:03] * Krinkle glances at the calendar [22:54:12] the envoy config is still split and not shared between k8s and appservers, so it's trivial to have localhost:6501 go to different places [22:55:08] :P "this" probably encompasses a lot [22:56:04] (03PS4) 10RLazarus: mediawiki: Drop logrotate config for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [22:57:14] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30789/console" [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [22:57:40] legoktm: ack, I was thinking more about between different k8s clusters. [22:57:57] 10SRE, 10SRE-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `gerritadmin` LDAP group - https://phabricator.wikimedia.org/T289537 (10thcipriani) Approved as managerā€”needed for group access request management. [22:58:02] which isn't an issue yet today as I assume we have only mwdebug right now [22:58:24] we have 3 main clusters, staging, eqiad and codfw [22:58:34] different definition of cluster. [22:58:56] oh, normal appserver vs api vs job? [22:58:58] I meant cluster as in... pods of the same image for the same purpose. [22:59:23] whatever way we exlain to envoy to reach "this" k8s cluster might need to vary [22:59:38] unless there's a way to address "this" k8s cluster from the inside in an agnostic way [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210823T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:09] rather than having to specify the service name of the k8s load balancer or some such. [23:00:16] (03CR) 10RLazarus: [V: 03+1 C: 03+2] mediawiki: Drop logrotate config for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/713503 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [23:00:28] although alternatively if we loop it directly back to the exact same host, then we wouldn't need to [23:00:39] it depends on whether we want requests to load balance or not for this use case [23:00:58] in k8s terminology that would be "deployment" [23:01:32] right, I knew that from the Toolforge use of it. [23:01:40] either "deployment" or "service" yeah, but definitely not "cluster" :) [23:01:52] I believe right now service names and k8s deployments are a 1:1 mapping [23:03:01] I feel like routing it back to the same exact pod might be better and easier to reason about (same MW version, especially in a rolling update / canary context.). [23:04:13] that would be a behavior change from what we do right now [23:04:39] I'm not sure those really matter though, most of these requests are cross-wiki which means you can't assume any consistent versioning anyways [23:06:09] right [23:06:24] and it would make it harder to decouple mutliversion / move multiversion up a layer [23:06:38] cool [23:07:23] I do recall us having or having had one or two cases where we use wgConf to determine something as a local hostname and then make an actual direct localhost http call with Host set [23:08:48] https://gerrit.wikimedia.org/g/mediawiki/core/+/3c5ef86bc0d82056929ada2287b4208a69420459/includes/SiteConfiguration.php#559 [23:08:51] we also have that mess [23:09:27] >.< [23:09:57] .. where maybe the fact that it uses php serialize() isnt the worst part of it [23:11:53] I think we can probably remove support for that, and convert uses of it to regular get(). It only exists for cases where things aren't given to InnitialiseSettings and thus can't be found from other wikis in-process. [23:12:24] that seems like something we could throw a fatal for and say "try harder" [23:13:38] https://codesearch.wmcloud.org/search/?q=wgConf-%3EgetConfig%5C(&i=nope&files=&excludeFiles=&repos= [23:30:35] (03CR) 10Ryan Kemper: [C: 03+2] linkrecommendation: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [23:32:13] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) >>! In T288848#7303412, @Legoktm wrote: > * Enable proxy in mwdebug k8s deployment too. Note that... [23:33:04] (03Merged) 10jenkins-bot: linkrecommendation: use CNAME for analytics-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/713934 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [23:40:45] !log ryankemper@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [23:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:14] !log T285355 `helmfile -e staging -i apply` on `/srv/deployment-charts/helmfile.d/services/linkrecommendation/` from `ryankemper@deploy1002` [23:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:17] T285355: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 [23:42:20] legoktm: should we capture "Use wgHTTPProxy at WMF" / "Remove extension-specific config vars" in a task? [23:42:29] Or do you want to tackle it as clean up within this one?