[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0000). [00:00:05] Smalyshev and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:20] I also have a patch https://gerrit.wikimedia.org/r/490326 [00:01:10] i can ship things [00:01:22] Zppix|mobile: will you be able to test from mobile? [00:01:32] * ebernhardson isn't sure how to test that, but trusts the +1's you have [00:01:34] I have my laptop in fromt of me ebernhardson [00:01:40] Zppix|mobile: alright perfect [00:01:57] Zppix|mobile: go ahead and add to the Deployments page please [00:02:15] Ebemhardson ok [00:03:31] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [00:04:35] (03Merged) 10jenkins-bot: Add config for switching Wikibase search to WikibaseCirrusSearch codebase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [00:05:07] SMalyshev: I've pulled your patch to mwdebug1001, although afaict it's a noop outside of beta [00:05:27] ebernhardson: it's noop everywhere actually [00:05:38] it just defines the variables, but doesn't do anything with them yet [00:05:46] ok excellent [00:06:02] there's a followup patch for it, but I'll do it tomorrow... I don't want to enable stuff and then go away [00:06:22] so if this one doesn't break the wiki, it's ok [00:06:30] sounds good, syncing [00:06:40] (03PS3) 10EBernhardson: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [00:06:47] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [00:07:18] !log ebernhardson@deploy1001 Synchronized wmf-config/: T215684 Add config for switching Wikibase search to WikibaseCirrusSearch codebase (duration: 00m 55s) [00:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:22] T215684: Deploy & test WikibaseCirrusSearch on beta cluster - https://phabricator.wikimedia.org/T215684 [00:07:22] (03CR) 10jenkins-bot: Add config for switching Wikibase search to WikibaseCirrusSearch codebase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [00:07:54] (03Merged) 10jenkins-bot: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [00:08:06] (03CR) 10jenkins-bot: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix) [00:08:26] Zppix|mobile: you're up on mwdebug1001 [00:08:34] Testing... [00:09:12] No errors on my end good from here ebernhardson [00:10:34] !log ebernhardson@deploy1001 Synchronized wmf-config/CommonSettings.php: T215725 Remove mediawikiwiki from wgCentralAuthAutoCreateWikis (duration: 00m 54s) [00:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:41] Zppix|mobile: synced ^ [00:10:41] T215725: Consider removing mediawikiwiki from wgCentralAuthAutoCreateWikis - https://phabricator.wikimedia.org/T215725 [00:10:47] Thanks [00:19:43] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:33] !log ebernhardson@deploy1001 Synchronized vendor/: Remove scalar type hints from ruflin/Elastica (duration: 00m 58s) [00:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:40] SWAT is complete [00:36:37] should have checked closer, one more patch to finish off the last sync [00:47:19] PROBLEM - Device not healthy -SMART- on db2033 is CRITICAL: cluster=mysql device=cciss,11 instance=db2033:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops [00:49:14] (03PS1) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 [00:50:07] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (owner: 10CRusnov) [00:50:59] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:55:20] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.19/vendor/: vendor/ruflin/Elastica: Remove scalar return type hints (duration: 01m 33s) [00:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:36] (03PS2) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 [00:59:06] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (owner: 10CRusnov) [00:59:18] (03PS3) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [01:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0100). [01:00:41] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [01:03:41] !log preparing to deploy phabricator-2019-02-27 [01:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:45] (03PS1) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 [01:10:37] (03CR) 10jerkins-bot: [V: 04-1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (owner: 10CRusnov) [01:10:53] (03PS2) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) [01:11:05] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is CRITICAL: 133.9 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [01:11:41] (03CR) 10jerkins-bot: [V: 04-1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [01:13:19] (03PS3) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) [01:16:19] (03PS4) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) [01:19:22] RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [01:20:41] !log deploying phabricator update 2019-02-27 [01:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:45] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) I wonder if --target-max-inflight-requests... [01:25:34] PROBLEM - nova-compute proc maximum on cloudvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [01:43:03] !log phabricator upgrade completed without issues (actually completed at 01:23 UTC but I failed to hit enter and submit this message) [01:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:21] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) a:03aaron [02:04:28] * chasemp waves to twentyafterfour - thanks for still keeping it rolling after all this time :) [02:08:48] !log clouddb1002 is now in place to replace labsdb1004 as replica for toolsdb but not wikilabels postgres yet T193264 [02:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:51] T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264 [02:18:26] RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [02:24:36] PROBLEM - nova-compute proc maximum on cloudvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [02:26:05] That host says 'notifications for this host have been disabled' [02:26:06] and yet... [02:48:33] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10kaldari) @MoritzMuehlenhoff - Now that the Thumbor hosts are upgraded to Debian Stretch, and Cargo has been made available in Stretch, are there any remaining blockers to upgrading li... [02:54:05] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 28.41 seconds [03:00:04] kart_: I, the Bot under the Fountain, allow thee, The Deployer, to do deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0300). [03:03:51] !log Manual run of unpublished ContentTranslation draft purge script (T216983) [03:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:55] T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983 [05:49:57] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T217301 (10Marostegui) p:05Triage→03Normal a:03Papaul @Papaul let's get the disk replaced Thank you [05:56:04] !log Upgrade MySQL on db1124 (Sanitarium) lag will be generated on s1,s3,s5,s8 [05:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:29] (03PS3) 10Elukey: Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231) [05:59:42] marostegui: ---^ \o/ [06:00:29] (03CR) 10Marostegui: [C: 03+1] Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [06:00:52] (03CR) 10Elukey: [C: 03+2] Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey) [06:01:34] marostegui: It seems my script is taking more time than scheduled. It should be finish in 15 more minutes, but let me know if anything is wrong. Running on mwmaint1002 (https://phabricator.wikimedia.org/T216983) [06:02:30] kart_: I had no idea there was that script running :) [06:02:56] marostegui: just in case someone is wondering :) [06:03:14] :) [06:13:27] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) [06:18:09] !log Finished manual run of unpublished ContentTranslation draft purge script (T216983) [06:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:13] T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983 [06:18:46] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Mholloway) A few more notes, for posterity: After digging some mor... [06:18:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 [06:21:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui) [06:22:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui) [06:23:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 55s) [06:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:28] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2033 is CRITICAL: cluster=mysql device=cciss,11 instance=db2033:9100 job=node site=codfw Marostegui T217301 - The acknowledgement expires at: 2019-03-08 06:24:09. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops [06:27:19] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [06:29:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui) [06:31:21] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/nginx] [06:45:39] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) @Cmjohnson DHCP works fine and I can PXE boot, but then the Debian installer complains about "no partition found".. I checked via the... [06:46:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The content of the files is correct, but I don't think they code belongs in the profile but rather the module as there is nothing wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [06:49:20] RECOVERY - nova-compute proc minimum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [06:50:00] !log Deploy schema change on db1079, this will generate lag on s7 on labs - T86342 [06:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:04] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [06:57:27] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:02:23] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Gilles) Yes, you can go ahead and switch Thumbor to Mcrouter. Thumbor uses memcached for non-critical... [07:08:04] !log Stop MySQL on db1079 for mysql upgrade [07:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:40] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) My 2 cents: * from [[ https://grafana.wi... [07:21:00] (03PS1) 10Marostegui: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 [07:22:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui) [07:23:08] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui) [07:24:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1079 after mysql upgrade (duration: 00m 56s) [07:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:20] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui) [07:30:16] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) There is one thing that I don't get from... [07:41:26] (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 [07:44:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui) [07:46:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui) [07:47:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1079 in API after mysql upgrade (duration: 00m 53s) [07:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui) [08:04:11] (03PS2) 10Elukey: hadoop analytics: move Yarn rmstore from zookeeper to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/493164 (https://phabricator.wikimedia.org/T216952) [08:04:57] (03CR) 10Elukey: [C: 03+2] hadoop analytics: move Yarn rmstore from zookeeper to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/493164 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [08:06:47] !log installing glibc security updates for stretch [08:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:27] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 [08:09:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui) [08:10:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui) [08:12:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1079 after mysql upgrade (duration: 00m 54s) [08:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui) [08:16:50] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 [08:22:20] !log Change abuse_filter_log indexes on s3 codfw, lag will appear on codfw - T187295 [08:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:23] T187295: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 [08:28:31] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10MoritzMuehlenhoff) @kaldari Effie looked into that and my initial estimation that we could build librsvg 2.42 in Stretch didn't hold, it needs more recent versions of Rust and Cargo t... [08:29:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui) [08:30:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui) [08:31:37] !log roll restart of Yarn Resource Managers on an-master100[1,2] to pick up new settings [08:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase API traffic db1079 after mysql upgrade (duration: 00m 53s) [08:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:49] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui) [08:38:36] (03CR) 10DCausse: "all plugins except hebrew and ltr has been officially released." (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse) [08:52:24] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 [08:57:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui) [08:58:12] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui) [08:59:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1079 (duration: 00m 53s) [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:45] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui) [09:02:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 [09:06:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui) [09:07:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui) [09:08:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 53s) [09:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] !log Stop MySQL on db1121 for upgrade, this will generate lag on labsdb:s4 [09:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:19] (03PS5) 10DCausse: Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) [09:12:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Add federation configs for commonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup) [09:14:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui) [09:17:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 [09:19:00] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui) [09:19:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui) [09:21:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 54s) [09:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] (03PS8) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [09:22:31] !log Stop MySQL on db1125 (sanitarium) to upgrade, this will generate lag on labs on: s2, s4, s6,s7 [09:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [09:26:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui) [09:26:33] !log installed php security updates on netmon1002 and people1001 [09:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] (03PS14) 10Giuseppe Lavagetto: mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [09:30:42] !log start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952 [09:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:46] T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952 [09:34:04] (03CR) 10Filippo Giunchedi: "LGTM, let us know when safe to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie) [09:37:22] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming/modules/ext.navigationTiming.js: T217210 Don't assume PerformanceObserver entry types are supported (duration: 00m 54s) [09:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:25] T217210: Nav timing throws exception on Safari "TypeError: entryTypes contained only unsupported types" - https://phabricator.wikimedia.org/T217210 [09:43:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [09:44:45] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14918/" [puppet] - 10https://gerrit.wikimedia.org/r/493232 (https://phabricator.wikimedia.org/T86969) (owner: 10Filippo Giunchedi) [09:44:53] (03PS2) 10Filippo Giunchedi: deployment_server: ship logs through logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493232 (https://phabricator.wikimedia.org/T86969) [09:45:20] <_joe_> godog: you'll have to wait for puppet-merging [09:45:39] <_joe_> i need to disable puppet and coordinate runs wherever hhvm is installed [09:45:43] <_joe_> thanks, puppet [09:46:47] _joe_: ok, no problem, thanks for the heads up [09:47:13] <_joe_> I'll unblock the deployment servers in a minute [09:48:53] kk, let me know when done [09:50:23] <_joe_> {{done}} [09:51:19] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:51:46] <_joe_> that's me, disregard [09:56:31] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:00:26] <_joe_> !log executing a rolling puppet run (2 server at a time per cluster, per dc) in eqiad,codfw as an HHVM restart will be triggered [10:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:30] (03PS1) 10Zoranzoki21: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) [10:05:51] (03PS1) 10ArielGlenn: update dashboard name so links to prometheus metric graphs work [software/tendril] - 10https://gerrit.wikimedia.org/r/493376 [10:07:05] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) [10:07:40] (03PS2) 10Zoranzoki21: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) [10:07:46] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:08:30] (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [10:09:04] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) [10:10:21] (03CR) 10Zoranzoki21: "Scheduled for today Morning SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [10:14:07] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [10:14:12] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) 05Open→03Stalled [10:17:53] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: HTTP CRITICAL - No data received from host [10:18:27] mmhm etherpad down [10:18:44] it was updated recently [10:19:07] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8961 bytes in 0.082 second response time [10:19:15] jijiki: how recently? [10:19:23] volans: I think last week [10:20:11] it is back anyway, wikimedia runs on coffee and etherpad :p [10:38:53] (03PS1) 10Urbanecm: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) [10:46:03] (03PS1) 10Urbanecm: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) [10:47:40] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:48:32] (03CR) 10Elukey: "I have a couple of questions:" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [10:52:19] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:13] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 76125 bytes in 0.463 second response time [10:53:31] RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK [11:00:40] (03PS1) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) [11:07:22] (03PS2) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) [11:11:11] (03CR) 10Addshore: [C: 04-2] "Federation is not and should not currently be turned on on commonswiki (yet)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup) [11:12:09] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is OK: (C)130 ge (W)110 ge 105.5 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:28:33] !log pause cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952 [11:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952 [11:30:59] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:19] PROBLEM - Host sca1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:27] PROBLEM - Host sca2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:35] expected ^ [11:31:38] actually [11:31:43] Let's celebrate! [11:32:01] LOL [11:32:05] !log remove sca1003, sca1004, sca2003, sca2004 from the fleet. Celebrate!!!! [11:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] PROBLEM - Host sca2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:32:45] puppet should run soon on icinga host so these are going away [11:35:38] <_joe_> akosiaris: \o/ [11:37:16] \o/ [11:53:13] (03PS4) 10WMDE-Fisch: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) [11:53:40] (03PS1) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) [11:55:41] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:37] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 76098 bytes in 0.268 second response time [11:59:43] !log rolling openssl security updates to jessie systems [11:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1200). [12:00:04] CFisch_WMDE and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] Here [12:00:13] \o/ [12:00:49] o/ [12:00:57] CFisch_WMDE: are you a deployer? [12:01:06] if not, why not!? ;) [12:01:34] Yes, but again in a meeting in parallel so not really able to do it myself -.- [12:01:49] no problemo, in that case [12:01:54] :-) [12:01:54] I can SWAT today! [12:03:20] CFisch_WMDE: I'll ping you when your patch is ready for testing at mwdebug [12:03:29] Cool! [12:05:57] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [12:07:07] (03Merged) 10jenkins-bot: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [12:07:17] (03Abandoned) 10Ladsgroup: Add federation configs for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup) [12:07:52] CFisch_WMDE: it's at mwdebug1002, please test and let me know if I can deploy it [12:08:01] Urbanecm: please stand by, you're next :) [12:08:07] ok [12:08:14] only throttles this time, so... :) [12:08:28] Urbanecm: ah, in that case I'll just ping you when I'm done :) [12:08:32] good [12:08:51] * CFisch_WMDE testing [12:11:32] zeljkof: hmm it's not really working [12:11:43] mwdebug1002 is correct? [12:12:06] CFisch_WMDE: yes, mwdebug1002 [12:13:01] CFisch_WMDE: revert? [12:13:09] No give me a sec. [12:13:25] ok [12:15:04] * addshore reads up [12:15:19] So there's no unintended behavior visible anywhere. Maybe it has to do with some strange caching. The change in the config is quite minor we might be good just deploying is. [12:16:17] addshore: Normally you should see the reference previews as beta feature on test.wikipedia.org [12:16:31] CFisch_WMDE: so, I should deploy? or wait? [12:16:32] (03CR) 10jenkins-bot: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [12:16:36] And it should be non-beta-featurish on the beta clusters [12:16:45] zeljkof: please deploy it [12:16:52] CFisch_WMDE: ok, I can revert after the deploy, if there are problems [12:16:55] deploying [12:18:12] !log zfilipin@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:491959|Show referencePreviews on group0 wikis as beta feature (T214905)]] (duration: 00m 56s) [12:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:15] T214905: Show referencePreviews on group0 wikis - https://phabricator.wikimedia.org/T214905 [12:18:36] CFisch_WMDE: it's deployed, please test and let me know if I should revert, in case of trouble [12:19:34] zeljkof: Cool, thanks I will check again. [12:19:51] CFisch_WMDE: I'm seeing this in the logs: MWContentSerializationException from line 155 of /srv/mediawiki/php-1.33.0-wmf.18/extensions/Wikibase/lib/includes/Store/EntityContentDataCodec.php: Content too big! Entity: Q27972199 [12:20:12] zeljkof: that is unrelate [12:20:18] zeljkof: there is a ticketr for that somewhere [12:20:24] ok, cool, did not notice it before [12:20:47] On beta I had some random errors with settings include something when loading the page [12:20:59] it's T215380 [12:21:00] T215380: Content too big! Entity: Q27972199 - https://phabricator.wikimedia.org/T215380 [12:21:04] ( but that was already before ) [12:25:14] (03PS1) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 [12:26:02] (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [12:26:40] zeljkof: I'll try to get that ticket done in the coming week! [12:27:02] addshore: please do, it's at the top of fatal-monitor and mediawiki-errors! o.O [12:27:05] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "Since this is allowing from a handful of servers it is not such horrible idea. I suggest if we are happy with how this goes, we can add so" [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [12:27:20] even more hits than T204871 [12:27:20] T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments - https://phabricator.wikimedia.org/T204871 [12:27:34] zeljkof: indeed, it is essentially just a bot trying to edit the entity, and each time they retry the edit it exceptions agian [12:27:44] bad bot [12:28:01] well, bad wikibase for bubbling the exception all the way up i guess :p [12:28:08] / for the page size limit [12:28:11] that too ;P [12:28:26] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm) [12:29:28] (03Merged) 10jenkins-bot: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm) [12:30:14] (03PS2) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 [12:30:56] (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [12:31:30] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493382|New throttle rule for Czech Wikigap 2019 (T217270)]] (duration: 00m 53s) [12:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:35] T217270: Add throttle rule for Czech Wikigap 2019 - https://phabricator.wikimedia.org/T217270 [12:32:59] (03PS2) 10Zfilipin: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm) [12:33:09] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm) [12:33:46] On the beta cluster I sometimes get this error Fatal error: Cannot redeclare wmfLabsSettings() (previously declared in /srv/mediawiki/wmf-config/InitialiseSettings-labs.php:81) in /srv/mediawiki/wmf-config/InitialiseSettings-labs.php on line 81 [12:33:46] (172.16.4.119) [12:34:07] (03Merged) 10jenkins-bot: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm) [12:34:23] CFisch_WMDE: revert? [12:34:40] nope I got this before already [12:35:03] ah and now at least one part of the patch seems to work [12:35:46] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493383|Add throttle rule for Day of Digital Service (T217155)]] (duration: 00m 52s) [12:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:52] T217155: Requesting temporary lift of IP cap - https://phabricator.wikimedia.org/T217155 [12:35:53] Urbanecm: all deployed [12:35:59] thx [12:36:10] CFisch_WMDE: I'm around in case a revert is needed [12:36:15] !log EU SWAT finished [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:28] CFisch_WMDE, Urbanecm: thank you for deploying with #releng ;) [12:36:33] yw [12:36:56] zeljkof: addshore Could it be that there's an additional weird caching / propagation for new beta features? [12:37:32] CFisch_WMDE: I don't know how those work :/ [12:39:34] (03CR) 10jenkins-bot: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm) [12:39:36] (03CR) 10jenkins-bot: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm) [12:41:06] Ok I'll just wait and see .... the part of the patch affecting the beta-cluster is working, maybe the rest will fix itself later. I'm grabbing lunch now. [12:44:44] CFisch_lunch: i cant think of any such cache :/ [12:44:51] and i checked the settings and they look correct [12:45:00] CFisch_lunch: is the expectation that it now appears as a BF? [12:45:04] or was it doing that before? [12:46:05] * zeljkof 's finger is hovering over big red REVERT button ;P [12:48:28] (03CR) 10Muehlenhoff: [C: 03+1] "Sound fine to me, anyone on the deployment hosts can easily do greater damage than what the admin port grants them." [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [12:54:40] CFisch_lunch: zeljkof is it in wgBetaFeaturesWhitelist ? [12:55:43] addshore: sorry, what? [12:55:44] :) [12:57:50] I think the BF hasn't been added to wgBetaFeaturesWhitelist, which is why it wont show :)_ [12:58:14] ah, I really don't know how that works [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1300) [13:18:08] (03PS2) 10Giuseppe Lavagetto: Scap: upgrade cloud VPS to 3.9.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [13:18:10] (03PS1) 10Giuseppe Lavagetto: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) [13:18:48] (03PS3) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) [13:26:38] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Thanks @Gilles ! I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and wi... [13:32:36] (03PS1) 10Joal: Update sqoop timers templates to follow new CLI [puppet] - 10https://gerrit.wikimedia.org/r/493406 (https://phabricator.wikimedia.org/T215290) [13:34:51] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet [13:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:54] (03PS1) 10Joal: Add ipblocks_restrictions table to labs sqoop list [puppet] - 10https://gerrit.wikimedia.org/r/493407 (https://phabricator.wikimedia.org/T209549) [13:43:37] !log depool prometheus1003.eqiad.wmnet to take a data snapshot [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:43] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [13:46:49] (03PS2) 10Effie Mouzeli: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [13:47:09] (03CR) 10Effie Mouzeli: [C: 03+1] scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [13:47:11] (03PS1) 10WMDE-Fisch: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) [13:47:27] (03PS2) 10WMDE-Fisch: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) [13:48:55] zeljkof: We know now, why it's not working -.- [13:49:00] we forgot this: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493409/ [13:50:38] (03CR) 10Marostegui: [V: 03+2 C: 03+2] update dashboard name so links to prometheus metric graphs work [software/tendril] - 10https://gerrit.wikimedia.org/r/493376 (owner: 10ArielGlenn) [13:51:52] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [13:52:32] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1003.eqiad.wmnet [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:01] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [13:55:52] (03Abandoned) 10Joal: Update sqoop timers templates to follow new CLI [puppet] - 10https://gerrit.wikimedia.org/r/493406 (https://phabricator.wikimedia.org/T215290) (owner: 10Joal) [13:56:08] (03Abandoned) 10Joal: Add ipblocks_restrictions table to labs sqoop list [puppet] - 10https://gerrit.wikimedia.org/r/493407 (https://phabricator.wikimedia.org/T209549) (owner: 10Joal) [13:56:47] !log re-start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952 [13:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:50] T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952 [14:00:05] hashar: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1400). [14:01:40] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [14:01:46] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [14:04:41] (03PS1) 10Elukey: role::analytics_cluster::coordinator: deploy common analytics repos [puppet] - 10https://gerrit.wikimedia.org/r/493411 [14:04:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [14:05:41] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: deploy common analytics repos [puppet] - 10https://gerrit.wikimedia.org/r/493411 (owner: 10Elukey) [14:06:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Please merge by giving +2 at your convenience." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/493196 (owner: 10Alexandros Kosiaris) [14:08:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 (owner: 10Muehlenhoff) [14:11:30] CFisch_remote: add it to the next swat? is it urgent? [14:11:42] (03PS1) 10Paladox: LocalUsernamesToLowerCase: Bind disabled GitReferenceUpdated instance [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493412 [14:11:52] hashar is train conductor this week, maybe he can deploy it outside of swat [14:12:00] zeljkof: yeah would be cool if it goes out today [14:12:56] I mean there's also a break later in the calendar - so maybe its also doable there [14:13:34] hashar would know, it's train now [14:13:39] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:13:39] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:13:39] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:13:40] kk [14:13:43] thx [14:13:45] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:13:51] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:13:55] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:14:37] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [14:16:07] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [14:16:07] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [14:16:07] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:16:15] RECOVERY - DPKG on notebook1003 is OK: All packages OK [14:16:23] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [14:17:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:17:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [14:18:01] (03PS2) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) [14:19:51] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:24:32] !log milimetric@deploy1001 Started deploy [analytics/refinery@f605fad]: New sqoop logic that uses the sharded replicas [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:41] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:27:39] bla bla [14:27:46] going to do the group1 upgrade [14:28:22] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/WikibaseMediaInfo: Move up checks to test if we should construct depicts widgets - T217285 (duration: 00m 58s) [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:26] T217285: commonswiki / wikibase: Postcondition failed: Namespace for entity type property must be defined! - https://phabricator.wikimedia.org/T217285 [14:28:55] (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 [14:28:57] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar) [14:30:39] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:30:41] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:30:47] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar) [14:30:50] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging] [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:52] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [14:30:52] !log akosiaris@deploy1001 scap-helm citoid finished [14:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:27] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:31:27] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:32:01] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:33:29] hmmm kubernetes etcd is not having a good time in eqiad [14:34:32] !log milimetric@deploy1001 Finished deploy [analytics/refinery@f605fad]: New sqoop logic that uses the sharded replicas (duration: 10m 00s) [14:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] RECOVERY - Disk space on notebook1003 is OK: DISK OK [14:35:09] hm latencies are dropping again [14:36:05] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar) [14:36:06] this isn't action related though [14:36:26] as the staging cluster who uses a different etcd cluster did not suffer from that [14:36:32] s/did not/did [14:36:42] so both cluster suffered [14:37:29] no create/delete actions but compareAndSwap, get and list all had skyrocketing latencies [14:38:01] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:38:11] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:38:51] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:38:51] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:39:17] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:39:22] nothing though on etcd logs [14:39:42] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) The core php72 packages had the extensions rebuilt for 7.2 were all rebuilt (in the correct ordering) within the... [14:40:13] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19 [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:21] ah no, finally found it [14:40:22] (03CR) 10Thcipriani: [C: 03+1] "Nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [14:40:23] Feb 28 14:32:37 etcd1003 etcd[453]: the connection to peer 52897700ee9bcb73 is unhealthy [14:40:23] Feb 28 14:32:37 etcd1003 etcd[453]: the connection to peer 460d53f044bf905e is unhealthy [14:40:27] ok, networking issue it seems [14:40:41] [Exception MWException] (/srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [14:40:43] grmblblblb [14:41:08] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.19 (duration: 00m 53s) [14:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:03] (03PS1) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to reach the opcode endpoint [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964) [14:42:58] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=rhodium.eqiad.wmnet [14:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:10] Feb 28 14:31:35 ganeti1008 kernel: [9245722.385430] block drbd2: Remote failed to finish a request within 43604ms > ko-count (7) * timeout (60 * 0.1s) [14:43:21] something networkingy looks like it [14:43:51] (03PS2) 10Elukey: Find db host and port using refinery [puppet] - 10https://gerrit.wikimedia.org/r/493331 (https://phabricator.wikimedia.org/T215290) (owner: 10Milimetric) [14:43:57] all drbd resources terminated and resumed [14:44:42] I am guessing something to do with ganeti1008 being re-racked (I am emptying it now) [14:45:46] so somehow wmf.19 lost the l10n cache for english language [14:46:02] well at least for some code paths / wikis / server [14:46:12] (03CR) 10Elukey: [C: 03+2] Find db host and port using refinery [puppet] - 10https://gerrit.wikimedia.org/r/493331 (https://phabricator.wikimedia.org/T215290) (owner: 10Milimetric) [14:46:24] oh [14:46:27] that is just on mw1272 [14:46:28] !log hashar@deploy1001 scap sync-l10n completed (1.33.0-wmf.19) (duration: 03m 33s) [14:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:44] !log mw1272 had /srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:25] !log mw1272 fixed by running "scap sync-l10n" from deploy host [14:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14921/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [14:50:01] !log reboot cloudnet2001-dev.codfw.wmnet [14:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:12] (03PS2) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to reach the opcode endpoint [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964) [14:50:40] <_joe_> hashar: I think mw1272 was down/in manitenance for some time [14:52:37] PROBLEM - Host cloudnet2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [14:53:01] _joe_: yeah I am assuming that as well. anyway syncing l10n did bring it up to date ;) [14:53:49] (03CR) 10Nikerabbit: [C: 04-1] Enable edittag for ExternalGuidance in CX and VE (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [14:54:59] !sal [14:54:59] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [14:55:47] RECOVERY - Host cloudnet2001-dev is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:59:37] (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421 [15:01:37] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:50] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe) [15:02:24] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe) a:03Mathew.onipe [15:02:48] (03PS2) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421 (https://phabricator.wikimedia.org/T216993) [15:10:50] (03PS2) 10Herron: add interface::add_ip6_mapped to default production node definition [puppet] - 10https://gerrit.wikimedia.org/r/480537 (https://phabricator.wikimedia.org/T102099) [15:13:37] (03CR) 10Herron: [C: 03+2] add interface::add_ip6_mapped to default production node definition [puppet] - 10https://gerrit.wikimedia.org/r/480537 (https://phabricator.wikimedia.org/T102099) (owner: 10Herron) [15:14:00] (03PS1) 10Jbond: Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802) [15:14:24] (03PS2) 10Jbond: Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802) [15:14:41] <_joe_> !log uploading scap 3.9.1-1 to {stretch,jessie}-wikimedia [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:42] !log powering off db1114 to replace motherboard T214720 [15:15:44] (03CR) 10Jbond: [C: 03+2] Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802) (owner: 10Jbond) [15:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:46] T214720: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 [15:16:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [15:17:10] <_joe_> thcipriani: ^^ I'm merging the patches that will bring scap 3.9.1 to beta [15:17:28] <_joe_> then we can test it I guess in production [15:17:49] <_joe_> (I love when I can say "we can test it in production") [15:17:59] (03PS3) 10Giuseppe Lavagetto: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) [15:18:45] _joe_: sure, I'm still finishing morning-routine pre-meeting stuff, but beta already, actually, overrides scap::version (managed via cumin there) so that patch is fine. It'll affect other labs projects that have scap installed for whatever reason. [15:19:07] (03PS2) 10Muehlenhoff: confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 [15:19:13] !log rebooting rhodium [15:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:16] _joe_: I'll be more around in 15 if I'm needed for testing [15:19:24] <_joe_> thcipriani: take your time [15:19:36] <_joe_> we can test things only quite some time later [15:21:15] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks, @santhosh. When you say "context detection code", I take that to mean inc... [15:21:35] PROBLEM - Host rhodium is DOWN: PING CRITICAL - Packet loss = 100% [15:22:22] (03PS3) 10Muehlenhoff: confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 [15:23:01] RECOVERY - Host rhodium is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [15:23:20] (03CR) 10Muehlenhoff: [C: 03+2] confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 (owner: 10Muehlenhoff) [15:23:27] PROBLEM - Host db1114.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:23:33] (03CR) 10CDanis: [C: 03+1] Initial tox setup [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493297 (owner: 10Volans) [15:23:35] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1005.eqiad.wmnet [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:07] (03PS1) 10Urbanecm: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) [15:26:17] (03PS1) 10Jbond: Add rhodium.eqiad.wmnet back into service [puppet] - 10https://gerrit.wikimedia.org/r/493426 (https://phabricator.wikimedia.org/T216802) [15:26:53] (03PS9) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) [15:27:01] (03CR) 10Jbond: [C: 03+2] Add rhodium.eqiad.wmnet back into service [puppet] - 10https://gerrit.wikimedia.org/r/493426 (https://phabricator.wikimedia.org/T216802) (owner: 10Jbond) [15:27:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [15:27:56] (03PS1) 10Urbanecm: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) [15:29:37] !log rebooting labstore2001 [15:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:55] PROBLEM - Host ms-fe1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:32:57] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:33:23] !log rebooting labstore2002 [15:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:49] (03PS8) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) [15:35:15] (03PS1) 10Herron: kafka-logging: replace logstash1005 with logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898) [15:36:44] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1005.eqiad.wmnet [15:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:39] !log rebooting labsdb1006 [15:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:45] (03PS1) 10Herron: logstash: disable notifications on logstash1005 and logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) [15:38:15] RECOVERY - Host ms-fe1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.12 ms [15:38:17] PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:46] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [15:38:59] hmm, is ganeti1008 expected? [15:39:06] PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [15:39:09] its not me [15:39:50] PROBLEM - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [15:40:15] yeah ganeti1008 was being moved, cc akosiaris [15:40:27] _joe_: I'm around now if you are [15:40:39] cool, was just looking at the syslogs. ok will leave alone [15:41:18] <_joe_> thcipriani: uhm, yes, so should I just install scap on deploy1001/1002 for now? [15:41:54] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 GTirloni T215892 [15:42:04] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute GTirloni T215892 [15:42:10] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute GTirloni T215892 [15:42:30] _joe_: I think that'll be good enough to test new functionality [15:42:46] Thanks cmjohnson1! :) [15:43:07] (03PS1) 10Giuseppe Lavagetto: Set wgWMEPhp7SamplingRate to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431 [15:43:18] <_joe_> thcipriani: assuming my puppet changes have arrived everywhere, yes [15:43:31] !log rebooting labsdb1007 [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:44] <_joe_> thcipriani: uh now that I think of it, I might need to merge one further change [15:43:57] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1006.eqiad.wmnet [15:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:23] <_joe_> thcipriani: yeah sorry [15:45:48] <_joe_> thcipriani: actually, we can test things right away [15:46:02] _joe_: are all the new dsh files setup for mw_web_clusters? [15:46:11] <_joe_> thcipriani: exactly that [15:46:14] :) [15:46:17] <_joe_> we'll override with -D [15:46:25] k [15:46:36] (03PS2) 10Herron: logstash: disable notifications on logstash1005 and logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) [15:46:47] <_joe_> !log install scap 3.9.1-1 on the deployment servers [15:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:22] (03PS1) 10Muehlenhoff: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433 [15:47:24] (03CR) 10Herron: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14926/" [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [15:48:43] PROBLEM - Host ores1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:53] <_joe_> thcipriani: installed, please let's first test a normal null run [15:48:58] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging] [15:48:59] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [15:48:59] !log akosiaris@deploy1001 scap-helm citoid finished [15:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:16] _joe_: doing [15:49:17] <_joe_> jijiki: can you take a look at ores1002? [15:49:30] it's being moved [15:49:33] let it be [15:49:38] <_joe_> oh ok [15:49:46] <_joe_> #-dcops I guess [15:50:21] PROBLEM - puppet last run on backup2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [15:50:47] !log thcipriani@deploy1001 Synchronized README: noop sync scap 3.9.1-1 (duration: 00m 52s) [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:49] PROBLEM - puppet last run on sessionstore2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [15:52:01] PROBLEM - Host ores1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:06] _joe_: no-php opcache sync-file looks good spot checking a few things [15:52:23] <_joe_> cool [15:52:40] (03Abandoned) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [15:52:42] (03PS1) 10DCausse: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 [15:52:44] (03PS1) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [15:52:47] <_joe_> so, to test the whole thing [15:52:54] !log rebooting labsdb1004 [15:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:01] (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993) [15:53:50] <_joe_> thcipriani: scap .... -Dphp7_admin_port:9181 -Dmw_web_clusters:mediawiki-appserver-canaries,mediawiki-api-canaries ? [15:53:53] (03PS2) 10DCausse: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 [15:53:56] (03PS2) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [15:54:19] _joe_: I'll go with those [15:54:45] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:54:56] 10Operations, 10Analytics, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) Proposal for removal: `registry brokers services etc consumers` @Ottomata what do you think? [15:55:23] RECOVERY - puppet last run on backup2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:55:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) [15:55:49] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:54] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:56:41] !log thcipriani@deploy1001 Synchronized README: noop sync to test opcache-manager in scap 3.9.1-1 (duration: 00m 53s) [15:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] RECOVERY - puppet last run on sessionstore2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:58:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10Ottomata) don't know about registry, services or etc, but /brokers and /consumers should be leftover from when we might have had a... [15:58:44] 10Operations, 10Discovery-Search: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10Mathew.onipe) [16:01:25] RECOVERY - Host db1114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.62 ms [16:01:26] jouncebot: now [16:01:26] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:43] CFisch_remote: you meant now right? :) [16:02:37] yay [16:02:54] is there a patch? :) [16:03:11] PROBLEM - Host db1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:15] (03PS12) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [16:03:42] addshore: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493409/ [16:03:46] (03PS9) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) [16:05:25] (03PS3) 10Addshore: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [16:05:50] ^^ if noone has any objections I'm going to push that one out in the next few mins (a followup from swat earlier) [16:06:26] (03PS1) 10Herron: rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898) [16:06:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) /etc is my fault when I've set up burrow the first time, and registry/services seems to be @joal's slider test (so safe to... [16:08:18] !log rebooting labstore2003 [16:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:33] (03CR) 10Addshore: [C: 03+2] Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [16:08:46] (03PS2) 10Herron: rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898) [16:09:59] (03Merged) 10jenkins-bot: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [16:10:04] (03CR) 10Herron: [C: 03+2] rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [16:11:10] CFisch_remote: it is on mwdebug1002 [16:11:56] checking [16:13:33] works like a charm \o/ [16:13:37] RECOVERY - Host db1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [16:13:39] can be deployed addshore [16:13:41] PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:42] CFisch_remote: amazing [16:13:49] (03PS13) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [16:14:46] syncing [16:15:35] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T214905 Add ReferencePreviews to allowed BetaFeatures (duration: 00m 54s) [16:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:39] T214905: Show referencePreviews on group0 wikis - https://phabricator.wikimedia.org/T214905 [16:15:41] CFisch_remote: all done! [16:18:20] (03PS14) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [16:19:49] (03CR) 10jenkins-bot: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [16:20:45] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) >>! In T187987#4990906, @fgiunchedi wrote: > Thanks @Gilles ! > > I have tried again a 9000h migration and that worked, I've now merged... [16:22:55] (03PS2) 10Herron: kafka-logging: replace logstash1005 with logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898) [16:23:11] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/14930/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:23:14] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:27:56] !log migrating kafka on logstash1005 to logstash1011 T213898 [16:27:56] !log migrating kafka on logstash1005 to logstash1011 T213898 [16:27:57] RECOVERY - Host ores1002 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:00] T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [16:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:50] double the logs double the fun… irc client disconnected/reconnected grr [16:29:05] RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 36.91 ms [16:29:34] (03CR) 10EBernhardson: cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:29:39] RECOVERY - Host ores1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.35 ms [16:29:50] (03PS4) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) [16:30:05] (03CR) 10Herron: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14928/" [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [16:30:34] (03PS1) 10Ottomata: eventgate-analytics Set kafka compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493443 (https://phabricator.wikimedia.org/T206785) [16:32:17] PROBLEM - Check whether ferm is active by checking the default input chain on ganeti1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:32:39] PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:34:24] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:36:33] PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:19] (03PS1) 10Ottomata: eventgate: set compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) [16:39:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) Got down to: ` [zk: localhost:2181(CONNECTED) 37] ls / [zookeeper, yarn-leader-election, hadoop-ha, hive_zookeeper_namesp... [16:39:06] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) [16:39:57] !log clean up old/stale zookeeper znodes from conf100[4-6] - T216979 [16:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:01] T216979: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 [16:43:17] RECOVERY - Check whether ferm is active by checking the default input chain on ganeti1008 is OK: OK ferm input default policy is set [16:43:21] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1006.eqiad.wmnet [16:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:46] RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:48:55] (03CR) 10DCausse: [C: 04-1] Upgrade logstash plugins to 5.6.14 (031 comment) [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [16:50:13] (03CR) 10CDanis: [C: 03+1] "A couple thoughts but don't feel strongly; this looks pretty good." (032 comments) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (owner: 10Volans) [16:54:42] PROBLEM - Host ms-be1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:55:51] (03PS1) 10BryanDavis: toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) [16:56:13] the ms-be1030 alert must be T212348 ? [16:56:13] T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 [16:56:47] (03CR) 10BryanDavis: "Needs testing in tools-beta or similar before deploying widely" [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [17:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:28] \o/ [17:00:46] cdanis: indeed, downtimed the host but not the mgmt [17:05:20] RECOVERY - Host ms-be1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [17:06:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] eventgate: set compression.codec: snappy and message.max.bytes: 4194304 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata) [17:06:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:08:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff) [17:08:13] (03PS2) 10Alexandros Kosiaris: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff) [17:09:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff) [17:09:38] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10Andrew) [17:10:31] (03CR) 10Nikerabbit: [C: 04-1] Enable edittag for ExternalGuidance in CX and VE (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [17:10:49] (03CR) 10Volans: [C: 03+2] Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 (owner: 10DCausse) [17:11:20] _joe_: https://phabricator.wikimedia.org/T217335 FYI, something missbehaving on beta when the php7 BF is on [17:11:29] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Move cloudvirt1018 to a 10G rack, connect 10G nics - https://phabricator.wikimedia.org/T217347 (10Andrew) [17:11:41] (03PS5) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) [17:11:57] <_joe_> addshore: good to know, I guess more details are neeeded [17:12:09] <_joe_> and probably you want to involve people in core platform as well [17:12:29] <_joe_> if you're sure it has to do with php7, please add "php7.2-support" as a tag [17:12:29] (03Merged) 10jenkins-bot: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 (owner: 10DCausse) [17:12:34] (03CR) 10CRusnov: [C: 03+2] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:13:10] yup, check checked on prod (group0) and the issue doesnt seem to show its face there, which is good. I'll go and tag the task now [17:13:31] (03PS2) 10Jbond: Load apt when testing base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/490837 [17:14:15] <_joe_> addshore: you can verify with XWD [17:14:29] (03CR) 10Jbond: [C: 03+2] Load apt when testing base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/490837 (owner: 10Jbond) [17:14:47] (03CR) 10EBernhardson: [C: 03+1] cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [17:14:50] _joe_: that ticket i linked you was totally the wrong one [17:15:16] <_joe_> uh? [17:15:34] <_joe_> which one is the correct one then? [17:15:34] (03PS3) 10Jbond: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [17:16:04] PROBLEM - Host ms-be1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:16:17] (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [17:16:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10Andrew) the next step is moving this to a new rack for 10G connections (either 2, 4 or 7 in row B) so I'm tagging dc-ops. You can ha... [17:16:54] _joe_: https://phabricator.wikimedia.org/T217323 [17:17:04] I even started adding comments to the wrong ticket, think my work day is over [17:17:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10Andrew) Steps: [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Upd... [17:17:33] <_joe_> addshore: ahah ok, but tbh it's easier to solve [17:17:40] yup [17:19:12] (03CR) 10Volans: [V: 03+2 C: 03+2] Initial tox setup [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493297 (owner: 10Volans) [17:19:20] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10Andrew) Steps: [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Update switch config [] Re-image [17:19:26] PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti] [17:19:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1018 to a 10G rack, connect 10G nics - https://phabricator.wikimedia.org/T217347 (10Andrew) Steps: [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Update... [17:19:48] (03PS4) 10Jbond: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff) [17:19:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1024 to 10Gb ethernet - https://phabricator.wikimedia.org/T216724 (10Andrew) [17:20:50] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti] [17:20:54] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti] [17:21:06] yep that's me [17:21:08] i broken it [17:21:22] RECOVERY - Host ms-be1028.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 37.26 ms [17:22:32] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti] [17:22:40] (03PS10) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) [17:23:38] !log recreating replicas, master ops events for db1078, db1075 T213858 [17:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:41] T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 [17:24:07] !log powering down sodium to move racks T212348 [17:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:10] T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 [17:25:32] (03PS1) 10CRusnov: ganeti: Fix path to etc/default file. [puppet] - 10https://gerrit.wikimedia.org/r/493456 [17:26:55] <_joe_> addshore: can you confirm if you still see the problem or not? [17:27:00] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [17:27:16] (03CR) 10CRusnov: [C: 03+2] ganeti: Fix path to etc/default file. [puppet] - 10https://gerrit.wikimedia.org/r/493456 (owner: 10CRusnov) [17:27:41] _joe_: solved for me [17:27:42] ty [17:28:24] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti] [17:28:27] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gtirloni on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1015.eqiad.wmnet ` The log can be found in `/var/log/wmf-a... [17:28:48] <_joe_> I just restarted php7.2-fpm [17:29:00] PROBLEM - Host sodium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:29:34] RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:35] (03PS3) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [17:32:41] ah lovely I was wondering why my pxe install was not working [17:32:45] sodium is down? [17:32:46] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:03] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [17:33:10] ah it is moving [17:33:12] bad timing [17:33:20] * elukey cries in a corner [17:33:32] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:34:30] elukey: Go torrent a linux iso [17:34:40] * Reedy looks around [17:35:09] * elukey hands labsdb1012 install over to Reedy [17:36:14] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:36:16] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:36:20] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:36:44] RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [17:36:52] PROBLEM - Host ms-be1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:38:36] (03PS4) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [17:39:37] (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) [17:39:42] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libguestfs-tools] [17:39:48] RECOVERY - Host sodium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [17:39:58] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [17:40:56] (03Abandoned) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [17:42:12] RECOVERY - Host ms-be1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [17:42:32] (03PS5) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [17:44:23] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T217301 (10Papaul) a:05Papaul→03Marostegui Disk replacement complete. [17:45:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) 05Open→03Resolved the motherboard has been replaced, the idrac and bios have been updated to latest version. resolving task, reopen if there are any problems. [17:47:35] (03PS1) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [17:47:43] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) FYI I solved the issue disabling the SD card support in: System Configuration -> Bios/Platform configuration -> System Options -> Usb... [17:49:17] (03CR) 10DCausse: [C: 03+1] Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [17:49:18] RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [17:51:46] !log logstash1011 kafka now in sync. transitioning logstash1005 to spare system T213898 [17:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:49] T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [17:52:34] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron) [17:52:46] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron) [17:53:19] (03CR) 10DCausse: [C: 04-1] Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1800). [18:00:46] ACKNOWLEDGEMENT - MegaRAID on sodium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T217356 [18:00:51] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T217356 (10ops-monitoring-bot) [18:01:58] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:10] (03CR) 10Herron: [C: 03+2] "ready to re-enable alerts now, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:04:20] (03PS1) 10Herron: Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470 [18:04:28] (03PS2) 10Herron: Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470 [18:05:30] (03CR) 10Herron: [C: 03+2] Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470 (owner: 10Herron) [18:07:19] 10Operations, 10Analytics, 10Services: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10Ottomata) [18:07:45] (03PS6) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 [18:10:34] (03PS1) 10Herron: kafka-logging: replace logstash1006 with logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) [18:12:11] PROBLEM - Kafka Broker Replica Max Lag on logstash1011 is CRITICAL: 4.978e+06 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [18:13:14] herron: ---^ [18:13:21] probably due to the migration [18:13:59] hmm yeah, no data points? [18:14:16] oh nvm [18:14:46] it’s in sync now, this should clear [18:17:09] PROBLEM - Kafka Broker Replica Max Lag on logstash1011 is CRITICAL: 8.235e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [18:19:04] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1015.eqiad.wmnet'] ` and were **ALL** successful. [18:19:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Nuria) p:05High→03Triage [18:19:18] these are false positives from while this host was syncing up. not sure why icinga is throwing critical. l ooking [18:21:12] "Alert if large replica lag for more than 50% of the time in the last 30 minutes.” [18:21:26] !log cp1076 power down for network port move [18:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:41] death to false positives [18:21:58] that’s a lot easier to understand than 8.235e+05 ge 5e+05 [18:22:29] RECOVERY - Kafka Broker Replica Max Lag on logstash1011 is OK: (C)5e+05 ge (W)1e+05 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [18:23:29] PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:55] (03PS1) 10Herron: logstash: disable notifications on logstash1006 and logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493476 (https://phabricator.wikimedia.org/T213898) [18:27:52] (03CR) 10Herron: [C: 03+2] logstash: disable notifications on logstash1006 and logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493476 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:28:05] RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [18:28:20] !log cp1077 power off for network port relocation [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:56] !log stop pybal on lvs1016 - T212348 [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:00] T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 [18:29:15] (03PS10) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) [18:30:27] PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [18:31:31] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=46) [18:31:35] (03CR) 10Ottomata: "Neither of these will matter for devs. snappy is fine and should be used anyway. The message size limit should be set the same for devel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata) [18:31:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [18:31:51] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:33:29] (03CR) 10CRusnov: "according to the documentation, this should allow core dumps to form (for example if the value of core_limit is to to 'unlimited') from se" [puppet] - 10https://gerrit.wikimedia.org/r/493294 (owner: 10CRusnov) [18:33:43] (03PS1) 10Herron: rsyslog: replace logstash1005 with logstash1011 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493477 (https://phabricator.wikimedia.org/T213898) [18:34:24] !log cp1078 power down for network move [18:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:35] RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 37.62 ms [18:36:01] PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:19] (03Abandoned) 10Ottomata: eventgate-analytics Set kafka compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493443 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata) [18:39:18] (03CR) 10Herron: [C: 03+2] rsyslog: replace logstash1005 with logstash1011 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493477 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:41:39] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:41] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:41:45] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:41:49] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:51] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:51] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:55] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:57] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:41:57] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:05] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:15] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:15] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:15] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:15] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:17] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:17] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:17] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:21] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:23] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:23] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:27] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:27] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:29] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:31] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:31] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:31] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:33] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:33] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:33] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:39] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:39] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:39] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6 [18:42:45] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:46] loud [18:42:47] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:47] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:42:49] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6 [18:43:27] RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [18:43:27] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [18:43:27] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [18:43:27] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [18:43:29] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK [18:43:31] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK [18:43:31] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK [18:43:31] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK [18:43:33] !log start pybal on lvs1016 - T212348 [18:43:35] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [18:43:37] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK [18:43:37] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK [18:43:41] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [18:43:41] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [18:43:43] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK [18:43:43] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [18:43:45] (03PS9) 10Herron: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 [18:43:47] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK [18:43:47] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK [18:43:47] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK [18:43:47] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK [18:43:47] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK [18:43:53] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK [18:43:53] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [18:43:53] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [18:43:55] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [18:44:01] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK [18:44:01] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK [18:44:01] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK [18:44:03] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK [18:44:04] (03CR) 10jerkins-bot: [V: 04-1] rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [18:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:07] T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 [18:44:07] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [18:44:26] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK [18:44:30] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK [18:46:03] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14931/" [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:46:05] 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) [18:46:28] 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) p:05Triage→03Normal [18:46:56] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 46 connections established with conf1004.eqiad.wmnet:4001 (min=46) [18:47:54] (03CR) 10Bstorm: labstore: convert our first systemd timer to the new format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:47:56] (03PS1) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) [18:49:37] (03PS2) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [18:50:13] (03CR) 10jerkins-bot: [V: 04-1] labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:50:39] (03PS1) 10Giuseppe Lavagetto: scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484 [18:50:41] (03PS1) 10Giuseppe Lavagetto: scap: fix php version, add php7 admin port [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) [18:50:43] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: stop revalidating opcache [puppet] - 10https://gerrit.wikimedia.org/r/493486 (https://phabricator.wikimedia.org/T211964) [18:52:06] !log migrating logstash1006 kafka to logstash1012 T213898 [18:52:09] (03PS2) 10Giuseppe Lavagetto: scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484 [18:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:10] T213898: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [18:52:22] !log installing libgd security updates on trusty [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:56] RECOVERY - Device not healthy -SMART- on db2033 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops [18:53:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484 (owner: 10Giuseppe Lavagetto) [18:53:53] (03CR) 10Herron: [C: 03+2] "As was done with logstash100[45], I've manually stopped kafka on logstash1006 and will leave puppet disabled there until after logstash101" [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:54:10] (03PS2) 10Herron: kafka-logging: replace logstash1006 with logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) [18:54:29] (03PS3) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [18:54:38] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK [18:55:07] (03CR) 10jerkins-bot: [V: 04-1] labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:59:38] (03PS2) 10Smalyshev: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) [18:59:50] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1900). [19:00:04] kostajh, Smalyshev, Zoranzoki21, Urbanecm, ottomata, and Pchelolo: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] here [19:00:11] here [19:00:14] here [19:01:08] (03PS4) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [19:01:32] here [19:01:38] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK [19:03:07] (03CR) 10Thcipriani: [C: 03+1] "nice! test went well today, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [19:03:16] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [19:03:44] (03PS1) 10Muehlenhoff: Remove obsolete Upstart job [puppet] - 10https://gerrit.wikimedia.org/r/493489 [19:05:02] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [19:05:21] I can SWAT [19:06:48] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK [19:06:48] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK [19:06:48] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK [19:07:02] (03PS2) 10Thcipriani: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:07:17] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:07:20] Thanks thcipriani. [19:07:52] sure thing :) [19:08:26] (03Merged) 10jenkins-bot: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:08:40] (03CR) 10jenkins-bot: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:09:04] (03PS5) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [19:09:17] kostajh: your change is live on mwdebug1002 if there's anything to check there. [19:09:38] thcipriani: just a moment [19:12:00] (03CR) 10Muehlenhoff: "With regard to updating the docs for the rollout, there's existing documentation for debdeploy which can be repurposed (or probably bets t" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto) [19:12:09] thcipriani: you can deploy it [19:12:12] thanks! [19:12:18] going live [19:14:16] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:07] (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks fine, but I haven't tested this further yet, I can ping this review set once the tests for the app servers are completed." [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [19:15:26] hrm, one of these hosts seems stuck, maybe mw1272? [19:15:46] yes, 1272 [19:16:09] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:493287|GrowthExperiments: Start help panel experiment on viwiki]] T215666 (duration: 03m 02s) [19:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:13] T215666: Help panel: deploy on Vietnamese Wikipedia - https://phabricator.wikimedia.org/T215666 [19:16:24] ^ kostajh live now [19:16:34] thcipriani: cheers! [19:16:37] PROBLEM - Host mw1272.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:16:55] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.080 second response time [19:17:04] (03PS2) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) [19:17:08] anyone around who could look into mw1272? [19:17:55] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time [19:18:16] robh: could you take a look at mw1272 or wrangle the appropriate person (I see you're on clinic duty)? [19:18:17] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [19:18:42] sure [19:18:50] thanks :) [19:19:15] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [19:19:17] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time [19:19:26] (03CR) 10Aaron Schulz: "If the "worst" reply of all routes used in the SET broadcast is NOT_STORED (e.g. due to NullRoute) and that is given back to either of the" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [19:19:48] SMalyshev: after your change merges it'll go out with the next run of https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [19:20:02] thcipriani: ok cool [19:20:08] (03PS10) 10Herron: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 [19:20:15] should only change anything for beta [19:20:37] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [19:20:55] (03Merged) 10jenkins-bot: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [19:21:11] (03CR) 10jenkins-bot: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [19:21:35] !log mw1272 unresponsive to mgmt or production interfaces [19:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:39] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.078 second response time [19:22:11] !log mw1272 being worked on by onsite [19:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:21] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm) [19:23:37] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [19:23:44] (03CR) 10Thcipriani: "invalid date in here somewhere. I'll look at the end of SWAT if I have time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:24:39] (03Merged) 10jenkins-bot: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm) [19:25:21] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [19:25:37] 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) Received the parts, replaced CPU2 and DIMM B1 and cleared the log Return shipping info USPS 9202 3946 2441 1124 14 FEDEX 9611918 2393026 77862432 [19:25:37] RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [19:26:20] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:493321|Enable WikibaseCirrusSearch on Beta Cluster]] (beta only change/noop sync) T215684 (duration: 00m 55s) [19:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:24] T215684: Deploy & test WikibaseCirrusSearch on beta cluster - https://phabricator.wikimedia.org/T215684 [19:26:25] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.080 second response time [19:27:39] Urbanecm: all merged, syncing your change now. [19:27:45] ok [19:28:21] Urbanecm: if you have a few minutes could you take a look at https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493375/ while I sync out the last couple changes? If you don't have time that's OK, too. [19:28:48] thcipriani, sure [19:28:54] thank you [19:29:16] yw [19:30:09] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [19:30:29] (03PS2) 10Thcipriani: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [19:30:38] !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493427|Add throttle rule for Art+Feminism 2019 editathon]] T217336 (duration: 00m 54s) [19:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:53] T217336: Lift IP cap on en.wiki for account creation for Jewish Museum NYC - Sunday March 3 - https://phabricator.wikimedia.org/T217336 [19:31:14] (03CR) 10jenkins-bot: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm) [19:31:17] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time [19:31:23] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [19:31:45] PROBLEM - HHVM jobrunner on mw2280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.027 second response time [19:32:16] danke [19:32:29] (03Merged) 10jenkins-bot: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [19:32:55] RECOVERY - HHVM jobrunner on mw2280 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.010 second response time [19:33:41] ottomata: oh, does wmf.19 need to be everywhere for this change? [19:34:07] we're not quite there just yet https://tools.wmflabs.org/versions/ [19:34:14] not for that one [19:34:19] ah, cool [19:34:23] there were 2 [19:34:29] i removed the one that was dependent on .19 [19:34:49] the code that went out last week should be using the new configs already [19:34:55] ottomata: ok, change is on mwdebug1002 if there's anything you want to check before it goes live [19:35:01] ok i'll double check just in case... [19:35:59] (03PS3) 10Urbanecm: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:36:11] (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:36:13] (03CR) 10Bstorm: "This seems to do only what it is supposed to! Madness https://puppet-compiler.wmflabs.org/compiler1002/14933/labstore2004.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [19:36:43] (03PS4) 10Urbanecm: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:36:47] <_joe_> !log upgrading scap on all servers [19:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:23] thcipriani, should be fixed in PS4 [19:37:54] thcipriani: all looks good [19:37:56] proceed! [19:37:58] Urbanecm: thank you! I appreciate it :) [19:38:04] ottomata: /me does [19:38:05] Yw [19:39:14] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) The new CPU came in and I replaced CPU1 Return Shipping USPS 9202 3946 5301 2441 1128 27 FEDEX 9611918 2393026 77862845 [19:40:23] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT [[gerrit:492770|Remove legacy eventBus config settings.]] (duration: 00m 53s) [19:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:39] ^ ottomata live now [19:40:49] danke [19:41:03] PROBLEM - mysqld processes on labsdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:41:31] page [19:41:37] <_joe_> yep [19:41:38] downtime expired I think [19:41:40] bstorm_: ^ [19:41:43] <_joe_> ah ok [19:41:54] Yeah...but I thought I brought it back up :) [19:41:55] Checking [19:42:43] (03CR) 10jenkins-bot: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [19:43:01] bstorm_: someone stopped it again apparentl [19:43:20] 190228 15:53:09 [Note] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown [19:43:23] It's no longer the replica of toolsdb [19:43:27] But still.... [19:43:32] MySQL is down [19:43:33] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:43:41] JBond42: ? [19:44:09] Sorry [19:44:13] jbond42 [19:44:20] bstorm_: I am going to dowmtime it again for now to avoid another page [19:44:26] I know he was looking to do reboots [19:44:38] Ok, thanks. I'd expect he'd downtime it though... [19:44:39] (03Merged) 10jenkins-bot: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:44:55] bstorm_: yeah, that matches the entry on the last, probably jbond42 [19:45:00] I guess he didn't bring mysql up? [19:45:15] Might not have known that he has to, actually :) [19:45:24] anything happening that should cause me to stop SWAT? [19:45:27] bstorm_: should I bring it up? [19:45:29] thcipriani: nope [19:45:29] Sure [19:45:31] k [19:45:43] bstorm_: ok, doing it [19:45:48] it is now up [19:46:02] cool thanks [19:46:03] RECOVERY - mysqld processes on labsdb1004 is OK: PROCS OK: 1 process with command name mysqld [19:46:07] thanks. Don't bother starting slave. clouddb1002 is already up to date with the master :) [19:46:23] bstorm_: if this host isn't important anymore, maybe we should disable notifications [19:46:25] I just have to get postgres migrated, and then we can turn off notifications for that old thing [19:46:32] ah ok :) [19:46:59] bstorm_: I am going back to the sofa then :) [19:47:05] Text me if you need me! [19:47:06] Enjoy! [19:47:08] Thanks [19:47:10] o/ [19:47:13] thcipriani, would you have time for https://gerrit.wikimedia.org/r/493425 too please? [19:49:13] Urbanecm: sure. you'll need to manually rebase though, gerrit can't seem to do it. [19:49:19] willdo [19:50:17] (03PS2) 10Urbanecm: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) [19:50:21] should be done [19:50:58] btw, anybody who knows how the dbname field works? I don't think "all" is a keyword [19:51:27] talking about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/493375/4/wmf-config/throttle.php [19:53:23] Urbanecm: yeah, we should remove that since it defaults to all. I assumed it was a group name. [19:53:37] I just looked into throttle-analyze.php [19:53:51] (03CR) 10jenkins-bot: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21) [19:53:51] jouncebot: next [19:53:51] In 0 hour(s) and 6 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T2000) [19:53:57] Relevant condition seems to be `isset( $options['dbname'] ) && !in_array( $wgDBname, (array)$options['dbname'] )` [19:54:01] take your time for the swat completion :) [19:54:27] that doesn't seem to understand groups at all [19:54:32] * thcipriani fixes [19:54:37] thanks thcipriani [19:55:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs to be released concurrently to a manual update of the php extension packages on deploy1001 and deploy2001" [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [19:55:36] (03PS1) 10Thcipriani: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 [19:55:56] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani) [19:56:38] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [19:57:01] (03Merged) 10jenkins-bot: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani) [19:57:26] Urbanecm: nice catch [19:57:31] thanks [19:58:10] (03PS3) 10Thcipriani: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [19:58:16] (03CR) 10Thcipriani: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [19:58:20] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [19:58:23] try that again [19:59:28] (03Merged) 10jenkins-bot: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T2000) [20:00:34] ehm, should I try something thcipriani ? [20:00:41] * Urbanecm is confused [20:00:52] Urbanecm: nope, just lining things up, syncing shortly [20:02:04] !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493425|Add throttle exception for Amnesty International Editathon]] [[gerrit:493496|Thottle Rules: remove "all"]] [[gerrit:493375|Add new throttle rules]] T216998 T217063 T217305 T217311 (duration: 00m 54s) [20:02:09] ^ Urbanecm all live now [20:02:17] great, thanks thcipriani [20:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:23] T216998: Throttle Exception for Amnesty International edit-a-thon on March 1st - https://phabricator.wikimedia.org/T216998 [20:02:24] T217311: Throttle Exception for Amnesty International edit-a-thon on March 8th - https://phabricator.wikimedia.org/T217311 [20:02:26] T217305: Lift IP cap on en.wiki for account creation for MoMA NYC - Saturday March 2, 2019 - https://phabricator.wikimedia.org/T217305 [20:02:27] T217063: Throttle Exception for WikiConNL Utrecht on 8 & 9 March 2019 - https://phabricator.wikimedia.org/T217063 [20:02:30] Urbanecm: thanks for your help with all those :) [20:02:41] always happy to help :) [20:02:52] hashar: you're clear for train [20:03:33] (I think European train was used this week) [20:04:48] I think the plan is to roll forward to all wikis now, since we're still not on group2 [20:05:12] cf: https://tools.wmflabs.org/versions/ [20:05:21] ah, okay then. thanks for explaining [20:05:41] (03CR) 10jenkins-bot: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani) [20:05:43] (03CR) 10jenkins-bot: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm) [20:08:12] were all the issues fixed already? [20:08:35] well, I guess logstash will tell us in due time :) [20:09:28] thcipriani: thx :) [20:09:32] processing with the train [20:09:43] yeah promoting all :) [20:09:52] (03PS1) 10Hashar: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 [20:09:54] (03CR) 10Hashar: [C: 03+2] all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar) [20:11:40] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar) [20:13:08] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19 [20:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:01] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9914 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:15:09] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.8947 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:15:17] that sounds bad [20:15:19] looking at that [20:17:01] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar) [20:18:47] herron: last event received seems to have been around 19:48UTC if it can helps [20:19:06] which also mean I should probably not have processed with the promote since we are blind? :( [20:20:14] thanks, yeah seeing that as well [20:20:21] trying logstash bounces right now [20:21:37] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2378 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:22:21] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [20:23:35] herron: and I would guess we lost the events haven't we? [20:23:47] depends on the transport [20:23:58] the kafka ones should still make their way into logstash [20:26:32] herron: do you have any idea what's going on yet? [20:27:09] not sure yet [20:27:45] it lines up with the time about when kafka was stopped on logstash1006, but that doesn’t explain why other log sources wouldn’t be flowing [20:28:11] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:28:27] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9946 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:30:04] herron: at least it seems I receive events from mediawiki now [20:33:30] for the last 15 min time period? [20:39:13] herron: hmm no cancel that sorry :( [20:39:19] I was looking at another source [20:39:25] ok, yeah seeing the same then [20:39:30] hey hashar train all good .19 out everywhere? [20:39:49] namely looking at mwlog1001.eqiad.wmnet:/srv/mw-log/exception.log [20:40:02] ottomata: I dont know :/ [20:40:56] oh no? [20:42:08] Class 'PHPlot' not found [20:42:09] really [20:42:21] one wonder how it is even possible to get those [20:45:19] herron: do we still use logstash's 'persistent queue' ? [20:46:39] cdanis: yes [20:46:56] there was a similar problem go.dog diagnosed over Christmas [20:47:00] all logstashen stuck around the same time [20:47:16] oh intersting. did disabling that help? [20:47:26] I believe he just cleared it and restarted them [20:47:31] ok let’s try it [20:47:43] 2018-12-26 16:36:16 godog ok I tried clearing logstash's persistent queue on 1007 and that seems to have done the trick [20:47:45] 2018-12-26 16:37:00 godog !log clear logstash persistent queue /var/lib/logstash on logstash100[789] [20:48:52] thanks, fingers crossed [20:49:30] hah, wow [20:49:35] yeah logs are flowing again [20:49:36] similar symptoms as before: jvm threads dropped, cpu usage down, lots of messages in the receive buffer but not being processed, logstash process blocked on a futex [20:49:39] that’s terrible [20:49:46] yeah, it is [20:49:50] I have no idea how go.dog guessed this at the time [20:50:15] "making a note to followup on this tomorrow, we might not even want/need the persistent queue now anyways" [20:50:22] we should figure that out ;) [20:51:33] yeah I’m super inclined to disable this [20:52:46] !log cleared logstash persistent queue on logstash100[7-9] [20:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:17] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [20:53:18] thanks cdanis, glad you have a good memory [20:53:39] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [20:54:25] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.00127 https://grafana.wikimedia.org/dashboard/db/logstash [20:54:31] \o/ [20:54:42] mostly we have to thank go.dog and grep [20:54:58] strange coincidence that it’s around the same time as moving logstash1006 kafka to logstash1012 [20:55:08] I wonder if something about that caused logstash to go into this state [20:56:17] it is very interesting that it happened on all the hosts around the same time, then and now [20:56:38] i don't believe there was maintenance or anything happening on 12/26 when it happened before; most people were off [20:57:10] hmm yeah… and none of the kafka work today was reverted or changed to fix this [20:57:20] apparently some old event got flushed to logstash [20:57:30] but only partially maybe 30% of them [20:57:34] with some drop around 20:00 [20:57:43] only change was to mv /var/lib/elasticsearch/queue /var/lib/elasticsearch/queue.bad across the log collectors [20:58:01] hashar: yeah that’s thanks to kafka, and working to increase that percentage [20:58:03] hm [20:58:11] and apparently we dont receive any since 20:34 [20:58:16] hashar: that might also just be indexing lag [20:58:17] ( I use https://logstash.wikimedia.org/app/kibana#/dashboard/api-feature-usage-top-users?_g=h@a4b4479&_a=h@092bdd1 ) [20:58:29] they are log messages emitted for deprecated mediawiki API calls [20:58:58] and we have bots literally hammering the api. So that is a steady/constant source of messages ;) [20:59:24] so it seems still broken somehow? [20:59:41] (03PS3) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365) [20:59:55] hashar: it’s catching up if you keep refreshing [21:00:23] these api-feature-usage logs are using kafka already and just need to catch up [21:00:36] ah [21:00:44] you can tell in the tags field, there is input-kafka-rsyslog-udp-localhost. the important part being input-kafka [21:00:47] no matter what is happening, I trust you ;)) [21:00:59] I have long lost track of changes that happened on the logging front! [21:01:26] haha [21:01:50] you know [21:01:55] I still have punch card at home [21:02:33] ((( meanwhile train looks fine ))) [21:02:37] i had 2 weeks of punchcard programming in high school because our teacher thought it was funny [21:02:58] he also would intentionally mention 1 time only 'you should number your cards on the corners' and then pick someone who didnt and knock over their shit [21:03:04] that teacher was an asshole. [21:03:09] well [21:03:14] it was actually a good lesson [21:03:18] * apergos gets their 'old guy on the porch' hat on [21:03:22] oh it was, but doesnt mean he wasnt an asshole! [21:03:29] ;))))) [21:03:30] i knew in advance to do that shit [21:03:34] someone had warned me ;D [21:03:37] I ran programs with punch cards because that's how we had to submit jobs [21:03:42] :-P :-P :-P [21:03:57] one typo and you went to fix it, submit the job again and wait another half hour... [21:03:58] punchard programming, then onto turbo pascal! woooooo [21:03:59] Ariel is the wisest of us all [21:04:11] indeed, everyone must stay off ariels lawn [21:04:13] or at least the oldest curmudgeonist [21:04:24] damn straight [21:05:06] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10mobrovac) [21:05:24] https://www.etsy.com/listing/555343740/vintage-ibm-punch-card-wreath we learned to make this crap out of them too :-P [21:07:09] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3418 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [21:07:59] oh no [21:08:08] oh yes [21:08:12] they hung up again [21:09:17] !log disabling logstash persisted queue [21:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:40] disabled live for now, workin on a patch next [21:12:43] mediawiki seems to be saturating logstash [21:13:21] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04888 https://grafana.wikimedia.org/dashboard/db/logstash [21:15:26] I can rollback the train [21:15:54] oh [21:15:57] no, it seems fine [21:16:05] also from https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667 [21:16:20] we had bunch of EventBus errors since 19:40 [21:16:30] looking [21:16:38] hm, that is interesting [21:16:45] which might or might not be related to: Synchronized wmf-config/CommonSettings.php: SWAT [[gerrit:492770|Remove legacy eventBus config settings.]] (duration: 00m 53s) [21:17:03] https://grafana.wikimedia.org/d/000000102/production-logging I cant remember how it is constructed [21:17:03] ya that should ahve been a no op, and things were functional on mwdebug1002 when i checked [21:17:14] I guess something reports to statsd the number of messages for each mediawiki buckets [21:17:22] (it is black magic to me really) [21:18:13] what is that? [21:18:16] mw logs by channel... [21:18:36] ok logstash [21:18:38] yeah [21:18:39] sorry [21:18:40] Unable to deliver all events: (curl error: 7) Couldn't connect to server [21:19:08] (03PS11) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) [21:19:34] https://logstash.wikimedia.org/goto/df640990bc4171aa955a6978c6479a64 for a search of "EventBus" around 19:40 [21:20:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [21:20:14] thank you my logstash fu is poor [21:20:46] that looks real bad [21:21:07] should we revert the mediawiki config chnage? [21:21:15] yes [21:21:23] wow [21:21:27] yikes. [21:22:24] hmm https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492770/ [21:22:26] these are all from jobrunner [21:22:28] it has a Depends-On: [21:22:43] ah yeah on the extension [21:22:58] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/492023/ [21:23:07] which is in 1.33.0-wmf.19 [21:23:14] but not in 1.33.0-wmf.18 :/ [21:23:16] petr was wrong with that depends on anyway, it just depended on a config change [21:23:24] okkk [21:23:30] but, in either case [21:23:36] the .19 is out now no? [21:23:44] so if it was that, the errros woudl stop? [21:24:02] I have no clue ;D [21:25:14] lets revert. [21:25:16] i dunno what is happening [21:25:27] (03PS1) 10Ottomata: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 [21:25:52] (03CR) 10Hashar: [C: 03+2] Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata) [21:26:48] (03Merged) 10jenkins-bot: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata) [21:27:12] ok hashar shall I sync? [21:27:20] ottomata: doing so ;) [21:27:22] ok thanks [21:27:24] then hmm [21:27:31] gotta look at something to confirm :/ [21:27:40] the devil is at "something" ;) [21:28:02] !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 00m 47s) [21:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:43] OH [21:28:44] hashar: [21:28:55] the logsstash you send was for back in time [21:28:56] not current [21:29:12] it isn't erroring now, is it? [21:29:28] that's why i couldn't find when i was looking [21:29:30] (03CR) 10jenkins-bot: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata) [21:29:34] I have one with timestamp=2019-02-28T23:13:00+00:00 [21:29:54] but @timestamp 2019-02-28T20:13:00 [21:30:21] time is hard :( [21:31:25] when did the train finsih everywhere? [21:31:46] 20:13 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19 [21:31:52] right. [21:32:02] rats, ok then i was totally wrong about that config change, it did depend...not sure how [21:32:11] especially since it worked for the smoke test on mwdebug1002 [21:32:26] must be something in the job submission part I don't understand [21:32:35] I dont know really :/ [21:32:46] for mediawiki.EventBus log bucket I have https://logstash.wikimedia.org/goto/aa1086cca43928785120653a1a100c6f [21:32:59] with bunch of errors from 19:40 till 19:56 [21:33:06] then I guess logstash servers dropped events [21:33:34] the logstash outage began around 20:06 or so [21:33:44] excuse me, about 20:09 [21:34:04] the latest error for this I see is the 20:13 one [21:34:17] then there are new EventBus error from 20:10 to 20:13 roughly [21:34:51] oh, of course, mwdebug would have had .19 when I tested during swat [21:35:05] but group2 wouldn't have, which I wouldn't have been testing on. [21:35:12] yeah [21:35:15] hashar: i dunno, in either case, let's leave this reverted until next week. [21:35:20] its almost end of day here [21:35:20] if that was the cause of the issue ;) [21:35:21] man [21:35:21] can see logstash start getting bogged down at 19:40 before the udp loss starts here https://grafana.wikimedia.org/d/000000564/logstash-herron-wip?orgId=1&from=now-3h&to=now [21:35:28] ok i have to figure out what to do now [21:35:31] will file a ticket [21:35:48] and to find out what happend to logstash :/ [21:35:55] oh interesting herron, I wonder if that correlates with a growth in the size of the persistent queue [21:36:33] logstash started having udp packet loss apparently right when wmf.19 reached all wikis :/ [21:37:00] yeah the node queue events graph is pretty interesting [21:37:30] herron: a few interesting things on https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=logstash1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logstash [21:37:34] the network TX rate [21:37:38] and at the bottom, 'procs blocked' [21:38:13] that's apparently just number of processes blocked on i/o [21:38:15] interesting [21:38:34] could it be a queue starting to fill up at 19:40 and overflowing at 20:10? [21:38:53] that’s what it looks like to me [21:38:58] and the new wmf.19 being deployed at 20:10 being unrelated? [21:39:11] I suspect the mw deploy is unrelated [21:39:24] logstash has failed this way in the past [21:39:45] cdanis: well what’s interesting is the mw error lgos correlate with the beginning of the queue spike [21:39:54] https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=4&fullscreen&edit&orgId=1&from=now-3h&to=now [21:39:54] and it processes way more events since 20:50 ( https://grafana.wikimedia.org/d/000000564/logstash-herron-wip?panelId=11&fullscreen&orgId=1&from=now-3h&to=now ) [21:40:16] hm, yeah, that is true herron [21:41:45] and I dont get why the rate of mediawiki logs is twice what we had before ( https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667&panelId=21&fullscreen ) [21:41:52] maybe the queue is flushing again since 20:50 [21:42:10] i'm not sure where that graph comes from [21:42:17] not sure either :/ [21:42:38] herron: are all of these mw logstash errors in kafka now? [21:42:44] I think it was a sprint releng did years ago so we could get a quick glance at the log spam [21:46:26] think so yeah, but don’t have a great way off hand to check how far logstash is behind kafka-logging. that would be super useful [21:46:51] herron: yall shoudl deploy a burrow checker for logstash [21:46:57] it just checks and reportrs consumer lag [21:47:03] (03PS1) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) [21:47:57] (03CR) 10jerkins-bot: [V: 04-1] wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [21:49:48] (03PS2) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) [21:50:06] ottomata: yeah that sounds perfect [21:51:43] (03PS1) 10Herron: logstash: disable persisted queue [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960) [21:51:43] bblack: time to remove "temporary" test? :) https://gerrit.wikimedia.org/r/c/operations/dns/+/161426/1/templates/wikipedia.org [21:51:45] herron: i'm going to try to save some of this event data [21:52:00] from the logstash topic [21:52:04] since the event data was emitted there [21:52:11] alright [21:52:11] you ok if I consume and grep from udp_localhost-err ? [21:52:18] i'm goign to do it from weblog1001 [21:52:22] and save the data locally there [21:52:26] and then figure out what to do [21:52:43] sure yeah, thanks for the heads up [21:53:00] oh, i think i can't connect from there [21:53:07] i'm going to do it from logstash1010 then [21:53:12] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [21:53:13] unless you know of a better place [21:53:20] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) 05Open→03Resolved Everything has been moved smoothly, thanks! [21:53:39] that’s fine too, but the connection from weblog1001 is timing out? [21:54:34] i think ferm rules ya [21:55:12] oh shouldn't be [21:55:13] dunno why [21:55:58] ok, yeah was wondering the same since a wide range of hosts are producing to it too [21:56:16] that was fast herron [21:56:20] ok i got 54K events [21:56:23] from that [21:56:46] 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) p:05Triage→03Normal [21:57:17] 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [21:57:17] oh no maybe more... [21:57:22] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) [21:57:30] firehose activated [21:57:32] so beside that logstash explosion, I am tempted to say the train went fine [21:57:34] 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [21:57:35] 269K [21:57:39] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [21:57:57] but I am assuming logstash manages to process enough even to have a good view of mediawiki health state :/ [21:58:07] 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [21:58:16] AFAIK there aren't good indicators for logstash indexing lag [21:58:19] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) [21:59:01] !log disable asw2-a5 <> asw-a link - T217383 [21:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:04] T217383: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 [22:00:45] ottomata: do you have an example of burrow configured for main, etc. to check out? [22:00:47] 10Operations, 10ops-eqiad: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [22:01:42] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) ChronologyProtector positions should be applie... [22:01:58] cdanis: for logstash lag, assuming it insert a timestamp when it processes an event and the original timestamp is kept; We could potentially do some query and look at the delta [22:02:01] wild guess really :( [22:02:19] i think in puppet somewhere [22:02:55] I can dig around too if it’s not a quick link or whatever [22:03:07] anyway [22:03:29] we have logs since maybe 21:10 UTC. Seems MediaWiki 1.33.0-wmf.19 is behaving properly [22:03:33] so I am closing the train [22:03:56] herron: https://github.com/wikimedia/puppet/blob/fae4f09c6b6215b3c91946120fba930063d15963/hieradata/role/common/kafka/monitoring.yaml [22:03:58] !log MediaWiki 1.33.0-wmf.19 deployed on all wikis # T206673 [22:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:01] T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673 [22:04:22] ottomata: sweet thanks [22:04:25] oh you have lofgging in there [22:04:55] (03PS12) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) [22:04:55] yeah… but not alerting? [22:05:08] i think lag metrics should be in prometheus then.. [22:05:13] maybe not alerting [22:05:52] herron: [22:05:52] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=logstash&from=1551369945820&to=1551391545820 [22:05:53] ? [22:06:32] heyyy look at that [22:06:52] bits overflow ? ;D [22:06:56] sweet, just need some alerts then [22:07:06] ya also [22:07:07] https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/profile/manifests/kafka/burrow.pp [22:07:12] but elukey would be more helpful here than me [22:07:15] i'm just looking around [22:07:40] (03PS1) 10Ayounsi: Remove asw2-a5-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383) [22:08:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [22:08:47] thanks it’s super helpful [22:09:47] ottomata: would be nice to figure out whether the Swat change at 19:40 actually required wmf.19 which had not been deployed yet to all wikis [22:09:59] hashar: just made https://phabricator.wikimedia.org/T217385 [22:10:27] hashar: indeed [22:10:31] I was looking at https://grafana.wikimedia.org/d/000000201/eventbus?from=1551369945820&to=1551391545820&orgId=1&panelId=1&fullscreen with "show deployments" ticked [22:10:51] and the # of requests / seconds drop when the mw-config change got deployed [22:11:00] yeah makes sense [22:11:03] but the rate goes up again when wmf.19 got deployed everywhere [22:11:15] as to why it would have caused logstash to explode .. I cant tell [22:11:41] hashar i would assume those are unrelated [22:11:45] well [22:11:51] i guess it would result in that many more logstash errors [22:12:22] during that period the 270K errors were produced to logstash [22:12:58] not that much more, thhat'd be around the order of +100-200 msgs per second [22:13:01] in the error topic [22:14:02] mobrovac: not sure if you are hehre [22:14:12] but i am going to replay these failed messages [22:14:23] i thihnk its the right thing to do. [22:14:52] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14936/" [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383) (owner: 10Ayounsi) [22:15:02] (03PS2) 10Ayounsi: Remove asw2-a5-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383) [22:15:25] Pchelolo: you gone? [22:16:12] not yet ottomata [22:16:15] ah k [22:16:21] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel for user and user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) [22:16:36] it seems you were very right about that config change depending on an eventbus extension patch [22:16:42] https://phabricator.wikimedia.org/T217385 [22:16:48] not sure how or why [22:17:06] Pchelolo: do you think i shoudl replay those events? I've captured them. [22:17:43] ottomata: thank you for the task. I have added a screenshot on it [22:17:57] thanks [22:18:08] and I am going to sleep ;) [22:18:32] thank you hashar [22:18:34] (also if you can find out why logstash exploded it would be nice ;) ) [22:18:34] sorry for the trouble [22:18:42] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel for user and user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493616 (https://phabricator.wikimedia.org/T215664) [22:18:53] beside that. The train for wmf.19 itself went fine! [22:19:05] ottomata: up to you, if it's easy to do [22:19:23] Pchelolo: i can script it for sure. it is the right thing to do, rifhgt? [22:19:37] otherwise we're missing e.g. 4277 revision-crarte events [22:19:42] (03PS2) 10Kosta Harlan: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) [22:19:43] and lots of job queue stuff? [22:20:12] the job queue is more important then rev-create [22:20:32] ya [22:20:36] ok, then will replay... [22:23:33] PROBLEM - Check correctness of the icinga configuration on icinga2001 is CRITICAL: Icinga configuration contains errors [22:23:51] I’ll look at that icinga alert [22:24:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) a:05ayounsi→03Cmjohnson [22:25:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [22:25:24] (03PS1) 10Reedy: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) [22:25:41] jouncebot: now [22:25:41] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [22:25:43] jouncebot: next [22:25:43] In 1 hour(s) and 34 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190301T0000) [22:25:57] (03CR) 10Reedy: [C: 03+2] "YOLO (will need testing on mwdebug)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:26:38] herron: I pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/493613 ~10min ago dunno if related [22:27:05] (03Merged) 10jenkins-bot: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:27:45] XioNoX: yeah looks like it is related [22:27:51] https://www.irccloud.com/pastebin/Mbz884sd/ [22:27:57] hum [22:28:22] herron: so probably because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/493613/2/hieradata/common/monitoring.yaml [22:28:53] that switch is gone though [22:29:14] (I'll re-add the statement for now) [22:29:30] (03CR) 10jenkins-bot: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:29:36] !log replaying events from mediawki eventbus config outage - T217385 [22:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:40] T217385: EventBus mediawiki outage 2019-02-28 - https://phabricator.wikimedia.org/T217385 [22:29:46] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable some translate special pages again T217376 (duration: 00m 47s) [22:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:50] T217376: PHP fatal error: Class 'PHPlot' not found - https://phabricator.wikimedia.org/T217376 [22:30:11] ok, wonder if these will clear too after running puppet on the affcted hosts? [22:31:08] herron: yeah I think we should remove them from puppet before removing the switch from puppet [22:31:13] robh: ^ [22:31:15] (03PS1) 10Reedy: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) [22:31:27] (03CR) 10Reedy: [C: 03+2] Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:31:40] ok, yeah was going to say I can’t reach these hosts haha [22:31:59] those are decoms right? [22:32:24] (03Merged) 10jenkins-bot: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:32:38] worth double checking, but if so clearing them from puppetdb and re-running puppet on icinga2001 should clear this [22:33:23] so is this error happening on icinga during puppet runs? [22:33:35] just not sure of history, trying to sort bakclog with the bot traffic isnt easy to parse [22:33:39] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove old translate config (duration: 00m 46s) [22:33:40] herron: ok, can you try to clear them from puppetdb? [22:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:53] ah yeah, icinga just recently threw a warning that the config has errors [22:33:57] the icinga config itself [22:33:59] yeah [22:34:02] checking that file now [22:34:05] and removing decom hosts [22:34:11] that were in asw2-a5 [22:34:43] ok, removing how? [22:34:57] well, i was going to just grep the entire puppet repo [22:34:59] from site.pp ? [22:35:01] and start removing all references [22:35:05] since they are no longer connected to anything [22:35:12] and referencing T208584 [22:35:12] T208584: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 [22:35:19] which is the generic decom eqiad cp systems [22:35:21] (03CR) 10Catrope: [C: 03+2] (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan) [22:35:25] do you know if they were already deactivated in puppet? [22:35:40] i'm still trying to find a more accurate ticket [22:35:45] to log this on, so standby =] [22:35:54] ie: a task with a per host entry and checklist [22:36:04] XioNoX: sure I’ll give that a shot [22:36:29] (03Merged) 10jenkins-bot: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan) [22:36:35] just whoever does something will have to edit this task [22:36:39] and check off what you did [22:36:39] I'd like to avoid having old data in Icinga, but first I'd like to not break Icinga :) [22:36:50] because everyone just fixing something and not updaitn ght etask wont work ;] [22:37:01] robh: https://phabricator.wikimedia.org/T208586 and https://phabricator.wikimedia.org/T208584 [22:37:09] for the decom [22:37:16] yeah im on 84 [22:37:18] it lacks checklist [22:37:20] so its not good enough [22:37:21] https://phabricator.wikimedia.org/T217383 for the switch itself [22:37:22] im editing it to add it [22:37:39] i mean, i just repeat over and over a decom task has to use the checklist [22:37:43] so if it lacks it its not good enough. [22:37:49] 10Operations, 10MediaWiki-Containers: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10hashar) And @thcipriani did https://tools.wmflabs.org/dockerregistry/ [22:40:51] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [22:40:52] ok, https://phabricator.wikimedia.org/T208584 is updated [22:41:04] so im going to just run the decom script on all of those, which will remove from puppet and icinga monitoring [22:41:15] (03CR) 10jenkins-bot: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy) [22:41:17] (03CR) 10jenkins-bot: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan) [22:41:19] if anyone wants to do another step, they certainly can, just use checkboxes =] [22:41:26] cleaning puppetdb looks to do the trick for one host, so proceeding with the rest in that error msg [22:41:44] puppet takes forever and a day to run on icinga, so will be a few, but that alert should clear shortly [22:41:52] herron: are you using the script tod o that? [22:41:55] if not, you can stop [22:41:59] ill run script on al of them [22:42:01] and it removes it [22:42:13] ie: we try to not do manual steps if we can help it [22:42:27] but, script should handle it being manually removed, if not, it'll need patching [22:42:35] " - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)" [22:43:10] * robh isnt doing anything for fear of two people doing the same thing breaking something [22:43:19] i'll wait until herron lets me know whats up [22:43:46] oh ok, yeah go for it. I deactivated the nodes from the error above but submitting the command again isn’t harmful [22:44:18] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) [22:46:00] i'll be baack in a bit to check on the replay [22:47:25] (03CR) 10Ayounsi: [C: 03+1] "LGTM, not sure what's better between using the subfolder or common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/493084 (owner: 10CRusnov) [22:48:18] RECOVERY - Check correctness of the icinga configuration on icinga2001 is OK: Icinga configuration is correct [22:48:27] herron: https://gerrit.wikimedia.org/r/c/operations/puppet/+/480537 <3 [22:48:51] haha yes! one small step [22:49:23] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [22:50:03] (03PS15) 10Eevans: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [22:51:11] herron: from puppet, we can probably list all the servers that have IPv6 with autoconfiguration, and apply that statement to them as well, so they have a proper v6 [22:53:26] it makes me so nervous to let puppet perform sweeping network interface changes on live systems [22:53:33] robh: ^ in case you didn't see herron's answer a few lines above [22:53:54] ? [22:54:00] i saw that its ok to double run commands? [22:54:03] or i miss something else? [22:54:14] robh: "oh ok, yeah go for it." [22:54:22] yeah, no worreis [22:54:33] as soon as the icinga error cleared i put it on back burner and updated the task for every cp on that task [22:54:48] going to just start working through it systematically since my understanding is icinga is ok again? [22:55:47] robh: yeah the error recovered [22:55:52] =] [22:55:54] yep icinga is happy, thanks robh [22:56:51] thanks you two! [22:57:09] i did nothing ;D [22:57:21] but you did it so well! [22:57:44] robh: once those servers are decom, it will unblock https://phabricator.wikimedia.org/T208734 too [23:02:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:03:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:03:28] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [23:04:00] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga2001 is CRITICAL: 59.17 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:04:24] what [23:06:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:07:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:07:51] bblack: ^ looking into that, so far no idea of what's going on, network looks fine [23:07:52] !log decom cp1045-cp1055, all are role spare but may icinga alert for ping [23:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:18] things seem to be back to normal [23:08:38] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga2001 is OK: (C)60 le (W)70 le 71.24 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:13:54] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1045.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:14:06] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1046.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:14:12] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) @Marostegui I've chosen not to reimage the server because this is right now a backup testing one, I think it is ok if currently doesn't have the right enwiki data.... [23:14:23] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1047.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:14:35] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1048.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:14:46] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1049.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:15:03] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1050.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:15:16] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1051.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:15:29] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1052.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:15:40] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1053.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:15:53] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1054.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:16:06] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1055.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:19:06] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) Please note cp1045-cp1055 are all on asw-c-eqiad as their active switch, but ports were also reserved on asw2-c-eqiad for migration (if they were not decommissioned befor... [23:22:14] so much spam [23:22:15] heh [23:22:22] thats what happens when scripts log per host ;D [23:22:26] XioNoX: were the caches ever repooled from the row A work? [23:22:52] bblack: ah no! [23:23:04] is that the cause? [23:23:13] not that that should cause any fallout, but just noticed while verifying some things [23:23:37] I forgot they need to be manually repooled [23:24:02] doing it now [23:24:54] well, verifying everything looks ok, then doing it now :) [23:25:34] bblack: how do you repool? https://wikitech.wikimedia.org/wiki/Service_restarts#Cache_proxies_(varnish)_(cp) ? [23:27:16] "pool-once" should probably go away, it has proven to be a bad idea :) [23:27:23] !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp107[678]\.eqiad\.wmnet [23:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:44] I ran: confctl select 'name=cp107[678]\.eqiad\.wmnet' set/pooled=yes [23:27:49] on puppetmaster1001 as root [23:28:03] (it autologs as shown above) [23:28:15] ah etcd! [23:28:16] (03PS1) 10RobH: decommission old eqiad cache entries [puppet] - 10https://gerrit.wikimedia.org/r/493626 (https://phabricator.wikimedia.org/T208584) [23:29:23] Re; the availability spike earlier, it does get confirmed on other graphs as a varnish 503 spike [23:29:49] (03CR) 10RobH: [C: 03+2] decommission old eqiad cache entries [puppet] - 10https://gerrit.wikimedia.org/r/493626 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH) [23:30:08] all sites, cache_text [23:30:15] so possibly a site<->site connectivity issue? [23:30:33] ah, only text [23:30:34] or even an internal problem in eqiad, it affected direct eqiad requests too [23:31:30] !log pre-configure asw-a1 ports on asw2-a1-eqiad - T187960 [23:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:34] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [23:31:34] so with what little I know so far: could be an issue with any of the services way behind cache_text, or lvs1006/16 (which sits between varnish and most of those services), or network stuff [23:33:13] ok [23:34:10] (03CR) 10Bstorm: "The two repos are not 100% identical. We can always revert, though :-D" [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis) [23:34:30] it looks a lot like just a varnish@eqiad problem though, so far [23:34:41] (03PS1) 10RobH: decom cp10[45-55] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/493627 (https://phabricator.wikimedia.org/T208584) [23:34:48] the same kind of thing we've seen with a misbehaving backend, although again network things could look similar [23:36:25] (03CR) 10RobH: [C: 03+2] decom cp10[45-55] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/493627 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH) [23:36:42] Went through the network logs and can't see anything out of the ordinary, but hard to tel, yeah [23:36:57] yeah seems like cp1075 is at fault [23:38:53] (03PS4) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684) [23:39:00] (03PS5) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684) [23:39:35] bblack: saw that in cp1075's logs? [23:40:10] in the weblog1001 5xx logs [23:40:23] ok [23:40:36] I'm trying to correlate anything strange happening on cp1075 itself so we don't just end up saying "varnish misbehaves sometimes" again :P [23:41:49] (03PS6) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T217276) [23:41:55] but really, there's not much that correlates [23:42:09] (03PS1) 10Smalyshev: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276) [23:42:15] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [23:43:36] would be useful to have a checklist or list of links to check when this issue happen [23:43:45] everything! [23:44:01] even though if it's "try to find a pattern in weblog1001's log" [23:44:10] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1 [23:44:37] ^ I tend to look here often/first, and select status code type "5" and view the bottom graph and drill around sites/caches [23:44:46] yeah that one is useful! [23:44:58] !log pre-configure asw-a2 ports on asw2-a2-eqiad - T187960 [23:44:59] it's similar to what the linked nginx availability graph shows, it's just older and I trust it more, but it's also clunkier [23:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:02] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [23:46:14] given the time range was very close on ~23:00, on weblog1001 this is what gave up the correlation to cp1075 specifically (anything else like lvs or applayer, would've been more spread around the caches, in most cases): [23:46:20] bblack@weblog1001:/srv/log/webrequest$ grep 2019-02-28T23:00 5xx.json |grep -w '"503"'|jq .x_cache|cut -d, -f1|sort|uniq -c|sort -rn|head -10 [23:46:23] 7985 "cp1075 int [23:46:25] 2 "cp1085 int [23:46:28] 1 "cp1089 int [23:46:38] but I got there in steps, obviously [23:47:28] the initial grep will give you the lines from 5xx.json that happened during that minute, then grepping (somewhat imprecisely, again) for 503s, then jq pulls out the "X-Cache" header, and taking the first entry in the comma-separate list with cut... [23:48:08] I probably piped the output of most of those things into a final "| head" as I was building up, just to see the data as it pares down and transforms [23:48:40] and then once I liked it after cut, I tacked on " |sort|uniq -c|sort -rn|head -10" which will glomp them all up by hitcount and give you the top 10 (in this case, there were only 3) [23:48:42] makes sens [23:49:06] there's probably a vastly simpler way that doesn't involve 4 commands heh [23:49:35] but I've learned some dumb finger-memory patterns, and that one decomposes well if you want to step back a little in it [23:49:58] yeah, writing that down for future use [23:51:41] bblack: unrelated, ok to replace the pool-once with the etcd depool in https://wikitech.wikimedia.org/wiki/Service_restarts#Cache_proxies_(varnish)_(cp) ? [23:52:26] well that whole section is slightly-confusing, but yes documentation of pool-once should probably go away [23:52:51] what still works and we still like: they will auto-depool themselves on clean shutdown, and you can always depool or pool as root on a running machine by typing "pool" or "depool" on it. [23:53:25] the confctl command was just easier for me than sshing to 3 different machines to run "pool", since I was already looking in confctl just to confirm what was already in what state [23:54:35] the pool/depool commands are not documented neither, will add [23:54:48] the pool-once saga is that we figured since we auto-depool on shutdown, maybe it would make life easier to be able to auto-repool when it comes back for simple cases. But we knew it should be explicit, so if you touch that pool-once file, when it powers up it will self-repool and delete the file. [23:55:13] but it turns out that's a horrible idea, because when you decide to poweroff/reboot a machine, you can't predict the future and don't know if it will ever be back [23:55:31] so we've had cases where we thought it was a quick reboot, but the hardware is borked and it never comes back. [23:56:11] then we open a hardware ticket, and some hardware part gets replaced, and maybe a week or three later someone finally boots it back up. But having been offline a few weeks, it's definitely not a state where we want it self-repooling, which is what we told it to do when we shut it down for the "quick" reboot. [23:56:26] because it's missing all kinds of updates and changes and ocsp state, etc during the long downtime. [23:57:26] so really that whole pool-once mechanism should be killed, just haven't gotten around to it [23:57:37] yeah, self repool should have at least a self check before [23:58:03] the problem is there's no great way to structure that. when it first boots who knows what it might be missing that happened during the time it was away. [23:58:25] maybe someone applied an extremely critical package update for a security bugfix to all the caches via cumin but this one missed it, or whatever. [23:59:18] or if down for > X hours then don't self-repool, and alert on icinga that a cp server is depool [23:59:27] it's kind of ok to auto-pool around a quick reboot, but in the general case it's probably better to just reinstall if it's been out of the loop for a long time.