[00:00:05] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0000).
[00:00:05] <jouncebot>	 Smalyshev and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:20] <Zppix|mobile>	 I also have a patch https://gerrit.wikimedia.org/r/490326
[00:01:10] <ebernhardson>	 i can ship things
[00:01:22] <ebernhardson>	 Zppix|mobile: will you be able to test from mobile?
[00:01:32] * ebernhardson isn't sure how to test that, but trusts the +1's you have
[00:01:34] <Zppix|mobile>	 I have my laptop in fromt of me ebernhardson
[00:01:40] <ebernhardson>	 Zppix|mobile: alright perfect
[00:01:57] <ebernhardson>	 Zppix|mobile: go ahead and add to the Deployments page please
[00:02:15] <Zppix|mobile>	 Ebemhardson ok
[00:03:31] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[00:04:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add config for switching Wikibase search to WikibaseCirrusSearch codebase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[00:05:07] <ebernhardson>	 SMalyshev: I've pulled your patch to mwdebug1001, although afaict it's a noop outside of beta
[00:05:27] <SMalyshev>	 ebernhardson: it's noop everywhere actually
[00:05:38] <SMalyshev>	 it just defines the variables, but doesn't do anything with them yet
[00:05:46] <ebernhardson>	 ok excellent
[00:06:02] <SMalyshev>	 there's a followup patch for it, but I'll do it tomorrow... I don't want to enable stuff and then go away
[00:06:22] <SMalyshev>	 so if this one doesn't break the wiki, it's ok
[00:06:30] <ebernhardson>	 sounds good, syncing
[00:06:40] <wikibugs>	 (03PS3) 10EBernhardson: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis  [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix)
[00:06:47] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix)
[00:07:18] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized wmf-config/: T215684 Add config for switching Wikibase search to WikibaseCirrusSearch codebase (duration: 00m 55s)
[00:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:22] <stashbot>	 T215684: Deploy & test WikibaseCirrusSearch on beta cluster - https://phabricator.wikimedia.org/T215684
[00:07:22] <wikibugs>	 (03CR) 10jenkins-bot: Add config for switching Wikibase search to WikibaseCirrusSearch codebase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493320 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[00:07:54] <wikibugs>	 (03Merged) 10jenkins-bot: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis  [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix)
[00:08:06] <wikibugs>	 (03CR) 10jenkins-bot: Remove mediawikiwiki from wgCentralAuthAutoCreateWikis  [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490326 (owner: 10Zppix)
[00:08:26] <ebernhardson>	 Zppix|mobile: you're up on mwdebug1001
[00:08:34] <Zppix|mobile>	 Testing...
[00:09:12] <Zppix|mobile>	 No errors on my end good from here ebernhardson
[00:10:34] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized wmf-config/CommonSettings.php: T215725 Remove mediawikiwiki from wgCentralAuthAutoCreateWikis (duration: 00m 54s)
[00:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:41] <ebernhardson>	 Zppix|mobile: synced ^
[00:10:41] <stashbot>	 T215725: Consider removing mediawikiwiki from wgCentralAuthAutoCreateWikis - https://phabricator.wikimedia.org/T215725
[00:10:47] <Zppix|mobile>	 Thanks
[00:19:43] <icinga-wm>	 PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:22:33] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized vendor/: Remove scalar type hints from ruflin/Elastica (duration: 00m 58s)
[00:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:40] <ebernhardson>	 SWAT is complete
[00:36:37] <ebernhardson>	 should have checked closer, one more patch to finish off the last sync
[00:47:19] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2033 is CRITICAL: cluster=mysql device=cciss,11 instance=db2033:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops
[00:49:14] <wikibugs>	 (03PS1) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348
[00:50:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (owner: 10CRusnov)
[00:50:59] <icinga-wm>	 RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:55:20] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.19/vendor/: vendor/ruflin/Elastica: Remove scalar return type hints (duration: 01m 33s)
[00:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:36] <wikibugs>	 (03PS2) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348
[00:59:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (owner: 10CRusnov)
[00:59:18] <wikibugs>	 (03PS3) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229)
[01:00:04] <jouncebot>	 twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0100).
[01:00:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[01:03:41] <twentyafterfour>	 !log preparing to deploy phabricator-2019-02-27
[01:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:09:45] <wikibugs>	 (03PS1) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349
[01:10:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (owner: 10CRusnov)
[01:10:53] <wikibugs>	 (03PS2) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229)
[01:11:05] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is CRITICAL: 133.9 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[01:11:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[01:13:19] <wikibugs>	 (03PS3) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229)
[01:16:19] <wikibugs>	 (03PS4) 10CRusnov: Add configuration for the ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493348 (https://phabricator.wikimedia.org/T215229)
[01:19:22] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[01:20:41] <twentyafterfour>	 !log deploying phabricator update 2019-02-27
[01:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:45] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) I wonder if --target-max-inflight-requests...
[01:25:34] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[01:43:03] <twentyafterfour>	 !log phabricator upgrade completed without issues (actually completed at 01:23 UTC but I failed to hit enter and submit this message) 
[01:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:02:21] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) a:03aaron
[02:04:28] * chasemp waves to twentyafterfour - thanks for still keeping it rolling after all this time :)
[02:08:48] <bstorm_>	 !log clouddb1002 is now in place to replace labsdb1004 as replica for toolsdb but not wikilabels postgres yet T193264
[02:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:51] <stashbot>	 T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T193264
[02:18:26] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[02:24:36] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[02:26:05] <andrewbogott>	 That host says 'notifications for this host have been disabled'
[02:26:06] <andrewbogott>	 and yet...
[02:48:33] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10kaldari) @MoritzMuehlenhoff - Now that the Thumbor hosts are upgraded to Debian Stretch, and Cargo has been made available in Stretch, are there any remaining blockers to upgrading li...
[02:54:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 28.41 seconds
[03:00:04] <jouncebot>	 kart_: I, the Bot under the Fountain, allow thee, The Deployer, to do  deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T0300).
[03:03:51] <kart_>	 !log Manual run of unpublished ContentTranslation draft purge script (T216983)
[03:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:55] <stashbot>	 T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983
[05:49:57] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T217301 (10Marostegui) p:05Triage→03Normal a:03Papaul @Papaul let's get the disk replaced Thank you
[05:56:04] <marostegui>	 !log Upgrade MySQL on db1124 (Sanitarium) lag will be generated on s1,s3,s5,s8
[05:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:29] <wikibugs>	 (03PS3) 10Elukey: Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231)
[05:59:42] <elukey>	 marostegui: ---^ \o/
[06:00:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey)
[06:00:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add labsdb1012 basic puppet settings [puppet] - 10https://gerrit.wikimedia.org/r/493299 (https://phabricator.wikimedia.org/T215231) (owner: 10Elukey)
[06:01:34] <kart_>	 marostegui: It seems my script is taking more time than scheduled. It should be finish in 15 more minutes, but let me know if anything is wrong. Running on mwmaint1002 (https://phabricator.wikimedia.org/T216983)
[06:02:30] <marostegui>	 kart_: I had no idea there was that script running :)
[06:02:56] <kart_>	 marostegui: just in case someone is wondering :)
[06:03:14] <marostegui>	 :)
[06:13:27] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey)
[06:18:09] <kart_>	 !log Finished manual run of unpublished ContentTranslation draft purge script (T216983)
[06:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:13] <stashbot>	 T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983
[06:18:46] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Mholloway) A few more notes, for posterity:   After digging some mor...
[06:18:57] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361
[06:21:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui)
[06:22:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui)
[06:23:55] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 55s)
[06:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:28] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2033 is CRITICAL: cluster=mysql device=cciss,11 instance=db2033:9100 job=node site=codfw Marostegui T217301 - The acknowledgement expires at: 2019-03-08 06:24:09. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops
[06:27:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational
[06:29:55] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493361 (owner: 10Marostegui)
[06:31:21] <icinga-wm>	 PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/nginx]
[06:45:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) @Cmjohnson DHCP works fine and I can PXE boot, but then the Debian installer complains about "no partition found".. I checked via the...
[06:46:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "The content of the files is correct, but I don't think they code belongs in the profile but rather the module as there is nothing wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[06:49:20] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[06:50:00] <marostegui>	 !log Deploy schema change on db1079, this will generate lag on s7 on labs - T86342
[06:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:04] <stashbot>	 T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342
[06:57:27] <icinga-wm>	 RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:02:23] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10Gilles) Yes, you can go ahead and switch Thumbor to Mcrouter. Thumbor uses memcached for non-critical...
[07:08:04] <marostegui>	 !log Stop MySQL on db1079 for mysql upgrade
[07:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:40] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) My 2 cents:  * from [[ https://grafana.wi...
[07:21:00] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362
[07:22:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui)
[07:23:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui)
[07:24:22] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1079 after mysql upgrade (duration: 00m 56s)
[07:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:20] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1079 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493362 (owner: 10Marostegui)
[07:30:16] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) There is one thing that I don't get from...
[07:41:26] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363
[07:44:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui)
[07:46:17] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui)
[07:47:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1079 in API after mysql upgrade (duration: 00m 53s)
[07:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:00] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1079 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493363 (owner: 10Marostegui)
[08:04:11] <wikibugs>	 (03PS2) 10Elukey: hadoop analytics: move Yarn rmstore from zookeeper to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/493164 (https://phabricator.wikimedia.org/T216952)
[08:04:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop analytics: move Yarn rmstore from zookeeper to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/493164 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey)
[08:06:47] <moritzm>	 !log installing glibc security updates for stretch
[08:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364
[08:09:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui)
[08:10:52] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui)
[08:12:45] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1079 after mysql upgrade (duration: 00m 54s)
[08:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493364 (owner: 10Marostegui)
[08:16:50] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365
[08:22:20] <marostegui>	 !log Change abuse_filter_log indexes on s3 codfw, lag will appear on codfw - T187295
[08:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:23] <stashbot>	 T187295: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295
[08:28:31] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10MoritzMuehlenhoff) @kaldari Effie looked into that and my initial estimation that we could build librsvg 2.42 in Stretch didn't hold, it needs more recent versions of Rust and Cargo t...
[08:29:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui)
[08:30:22] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui)
[08:31:37] <elukey>	 !log roll restart of Yarn Resource Managers on an-master100[1,2] to pick up new settings
[08:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:48] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase API traffic db1079 after mysql upgrade (duration: 00m 53s)
[08:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:49] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1079 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493365 (owner: 10Marostegui)
[08:38:36] <wikibugs>	 (03CR) 10DCausse: "all plugins except hebrew and ltr has been officially released." (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse)
[08:52:24] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367
[08:57:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui)
[08:58:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui)
[08:59:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1079 (duration: 00m 53s)
[08:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:45] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493367 (owner: 10Marostegui)
[09:02:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369
[09:06:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui)
[09:07:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui)
[09:08:14] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 53s)
[09:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:25] <marostegui>	 !log Stop MySQL on db1121 for upgrade, this will generate lag on labsdb:s4
[09:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:19] <wikibugs>	 (03PS5) 10DCausse: Plugins for elasticsearch 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/491297 (https://phabricator.wikimedia.org/T199791)
[09:12:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Add federation configs for commonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup)
[09:14:35] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493369 (owner: 10Marostegui)
[09:17:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371
[09:19:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui)
[09:19:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui)
[09:21:13] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 54s)
[09:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:27] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793
[09:22:31] <marostegui>	 !log Stop MySQL on db1125 (sanitarium) to upgrade, this will generate lag on labs on: s2, s4, s6,s7
[09:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[09:26:07] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493371 (owner: 10Marostegui)
[09:26:33] <moritzm>	 !log installed php security updates on netmon1002 and people1001
[09:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:43] <wikibugs>	 (03PS14) 10Giuseppe Lavagetto: mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[09:30:42] <elukey>	 !log start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952
[09:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:46] <stashbot>	 T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952
[09:34:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, let us know when safe to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie)
[09:37:22] <logmsgbot>	 !log gilles@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming/modules/ext.navigationTiming.js: T217210 Don't assume PerformanceObserver entry types are supported (duration: 00m 54s)
[09:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:25] <stashbot>	 T217210: Nav timing throws exception on Safari "TypeError: entryTypes contained only unsupported types" - https://phabricator.wikimedia.org/T217210
[09:43:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Move hhvm-fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[09:44:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14918/" [puppet] - 10https://gerrit.wikimedia.org/r/493232 (https://phabricator.wikimedia.org/T86969) (owner: 10Filippo Giunchedi)
[09:44:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: deployment_server: ship logs through logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493232 (https://phabricator.wikimedia.org/T86969)
[09:45:20] <_joe_>	 godog: you'll have to wait for puppet-merging
[09:45:39] <_joe_>	 i need to disable puppet and coordinate runs wherever hhvm is installed
[09:45:43] <_joe_>	 thanks, puppet
[09:46:47] <godog>	 _joe_: ok, no problem, thanks for the heads up
[09:47:13] <_joe_>	 I'll unblock the deployment servers in a minute
[09:48:53] <godog>	 kk, let me know when done
[09:50:23] <_joe_>	 {{done}}
[09:51:19] <icinga-wm>	 PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:51:46] <_joe_>	 that's me, disregard
[09:56:31] <icinga-wm>	 RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[10:00:26] <_joe_>	 !log executing a rolling puppet run (2 server at a time per cluster, per dc) in eqiad,codfw as an HHVM restart will be triggered
[10:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:30] <wikibugs>	 (03PS1) 10Zoranzoki21: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998)
[10:05:51] <wikibugs>	 (03PS1) 10ArielGlenn: update dashboard name so links to prometheus metric graphs work [software/tendril] - 10https://gerrit.wikimedia.org/r/493376
[10:07:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991)
[10:07:40] <wikibugs>	 (03PS2) 10Zoranzoki21: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998)
[10:07:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:08:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[10:09:04] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991)
[10:10:21] <wikibugs>	 (03CR) 10Zoranzoki21: "Scheduled for today Morning SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[10:14:07] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki)
[10:14:12] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) 05Open→03Stalled
[10:17:53] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: HTTP CRITICAL - No data received from host
[10:18:27] <volans>	 mmhm etherpad down
[10:18:44] <jijiki>	 it was updated recently 
[10:19:07] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8961 bytes in 0.082 second response time
[10:19:15] <volans>	 jijiki: how recently?
[10:19:23] <jijiki>	 volans: I think last week 
[10:20:11] <jijiki>	 it is back anyway, wikimedia runs on coffee and etherpad :p
[10:38:53] <wikibugs>	 (03PS1) 10Urbanecm: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270)
[10:46:03] <wikibugs>	 (03PS1) 10Urbanecm: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155)
[10:47:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/493377 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:48:32] <wikibugs>	 (03CR) 10Elukey: "I have a couple of questions:" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[10:52:19] <icinga-wm>	 PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:53:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 76125 bytes in 0.463 second response time
[10:53:31] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK
[11:00:40] <wikibugs>	 (03PS1) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070)
[11:07:22] <wikibugs>	 (03PS2) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070)
[11:11:11] <wikibugs>	 (03CR) 10Addshore: [C: 04-2] "Federation is not and should not currently be turned on on commonswiki (yet)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup)
[11:12:09] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is OK: (C)130 ge (W)110 ge 105.5 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[11:28:33] <elukey>	 !log pause cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952
[11:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:36] <stashbot>	 T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952
[11:30:59] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[11:31:19] <icinga-wm>	 PROBLEM - Host sca1003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:31:27] <icinga-wm>	 PROBLEM - Host sca2004 is DOWN: PING CRITICAL - Packet loss = 100%
[11:31:35] <akosiaris>	 expected ^
[11:31:38] <akosiaris>	 actually
[11:31:43] <akosiaris>	 Let's celebrate!
[11:32:01] <jijiki>	 LOL 
[11:32:05] <akosiaris>	 !log remove sca1003, sca1004, sca2003, sca2004 from the fleet. Celebrate!!!!
[11:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:13] <icinga-wm>	 PROBLEM - Host sca2003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:32:45] <akosiaris>	 puppet should run soon on icinga host so these are going away
[11:35:38] <_joe_>	 akosiaris: \o/
[11:37:16] <moritzm>	 \o/
[11:53:13] <wikibugs>	 (03PS4) 10WMDE-Fisch: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905)
[11:53:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964)
[11:55:41] <icinga-wm>	 PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:56:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 76098 bytes in 0.268 second response time
[11:59:43] <jbond42>	 !log rolling openssl security updates to jessie systems
[11:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1200).
[12:00:04] <jouncebot>	 CFisch_WMDE and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:11] <Urbanecm>	 Here
[12:00:13] <CFisch_WMDE>	 \o/
[12:00:49] <zeljkof>	 o/
[12:00:57] <zeljkof>	 CFisch_WMDE: are you a deployer?
[12:01:06] <zeljkof>	 if not, why not!? ;)
[12:01:34] <CFisch_WMDE>	 Yes, but again in a meeting in parallel so not really able to do it myself -.-
[12:01:49] <zeljkof>	 no problemo, in that case
[12:01:54] <CFisch_WMDE>	 :-)
[12:01:54] <zeljkof>	 I can SWAT today!
[12:03:20] <zeljkof>	 CFisch_WMDE: I'll ping you when your patch is ready for testing at mwdebug
[12:03:29] <CFisch_WMDE>	 Cool!
[12:05:57] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[12:07:07] <wikibugs>	 (03Merged) 10jenkins-bot: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[12:07:17] <wikibugs>	 (03Abandoned) 10Ladsgroup: Add federation configs for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493339 (https://phabricator.wikimedia.org/T217285) (owner: 10Ladsgroup)
[12:07:52] <zeljkof>	 CFisch_WMDE: it's at mwdebug1002, please test and let me know if I can deploy it
[12:08:01] <zeljkof>	 Urbanecm: please stand by, you're next :)
[12:08:07] <Urbanecm>	 ok
[12:08:14] <Urbanecm>	 only throttles this time, so... :)
[12:08:28] <zeljkof>	 Urbanecm: ah, in that case I'll just ping you when I'm done :)
[12:08:32] <Urbanecm>	 good
[12:08:51] * CFisch_WMDE testing
[12:11:32] <CFisch_WMDE>	 zeljkof: hmm it's not really working
[12:11:43] <CFisch_WMDE>	 mwdebug1002 is correct?
[12:12:06] <zeljkof>	 CFisch_WMDE: yes, mwdebug1002
[12:13:01] <zeljkof>	 CFisch_WMDE: revert?
[12:13:09] <CFisch_WMDE>	 No give me a sec.
[12:13:25] <zeljkof>	 ok
[12:15:04] * addshore reads up
[12:15:19] <CFisch_WMDE>	 So there's no unintended behavior visible anywhere. Maybe it has to do with some strange caching. The change in the config is quite minor we might be good just deploying is.
[12:16:17] <CFisch_WMDE>	 addshore: Normally you should see the reference previews as beta feature on test.wikipedia.org
[12:16:31] <zeljkof>	 CFisch_WMDE: so, I should deploy? or wait?
[12:16:32] <wikibugs>	 (03CR) 10jenkins-bot: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[12:16:36] <CFisch_WMDE>	 And it should be non-beta-featurish on the beta clusters
[12:16:45] <CFisch_WMDE>	 zeljkof: please deploy it
[12:16:52] <zeljkof>	 CFisch_WMDE: ok,  I can revert after the deploy, if there are problems
[12:16:55] <zeljkof>	 deploying
[12:18:12] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:491959|Show referencePreviews on group0 wikis as beta feature (T214905)]] (duration: 00m 56s)
[12:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:15] <stashbot>	 T214905: Show referencePreviews on group0 wikis - https://phabricator.wikimedia.org/T214905
[12:18:36] <zeljkof>	 CFisch_WMDE: it's deployed, please test and let me know if I should revert, in case of trouble
[12:19:34] <CFisch_WMDE>	 zeljkof: Cool, thanks I will check again.
[12:19:51] <zeljkof>	 CFisch_WMDE: I'm seeing this in the logs:  MWContentSerializationException from line 155 of /srv/mediawiki/php-1.33.0-wmf.18/extensions/Wikibase/lib/includes/Store/EntityContentDataCodec.php: Content too big! Entity: Q27972199
[12:20:12] <addshore>	 zeljkof: that is unrelate
[12:20:18] <addshore>	 zeljkof: there is a ticketr for that somewhere
[12:20:24] <zeljkof>	 ok, cool, did not notice it before
[12:20:47] <CFisch_WMDE>	 On beta I had some random errors with settings include something when loading the page
[12:20:59] <zeljkof>	 it's T215380
[12:21:00] <stashbot>	 T215380: Content too big! Entity: Q27972199 - https://phabricator.wikimedia.org/T215380
[12:21:04] <CFisch_WMDE>	 ( but that was already before )
[12:25:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401
[12:26:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff)
[12:26:40] <addshore>	 zeljkof: I'll try to get that ticket done in the coming week!
[12:27:02] <zeljkof>	 addshore: please do, it's at the top of fatal-monitor and mediawiki-errors! o.O
[12:27:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "Since this is allowing from a handful of servers it is not such horrible idea. I suggest if we are happy with how this goes, we can add so" [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[12:27:20] <zeljkof>	 even more hits than T204871
[12:27:20] <stashbot>	 T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments - https://phabricator.wikimedia.org/T204871
[12:27:34] <addshore>	 zeljkof: indeed, it is essentially just a bot trying to edit the entity, and each time they retry the edit it exceptions agian
[12:27:44] <zeljkof>	 bad bot
[12:28:01] <addshore>	 well, bad wikibase for bubbling the exception all the way up i guess :p
[12:28:08] <addshore>	  / for the page size limit
[12:28:11] <zeljkof>	 that too ;P
[12:28:26] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm)
[12:29:28] <wikibugs>	 (03Merged) 10jenkins-bot: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm)
[12:30:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401
[12:30:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff)
[12:31:30] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493382|New throttle rule for Czech Wikigap 2019 (T217270)]] (duration: 00m 53s)
[12:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:35] <stashbot>	 T217270: Add throttle rule for Czech Wikigap 2019 - https://phabricator.wikimedia.org/T217270
[12:32:59] <wikibugs>	 (03PS2) 10Zfilipin: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm)
[12:33:09] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm)
[12:33:46] <CFisch_WMDE>	 On the beta cluster I sometimes get this error Fatal error: Cannot redeclare wmfLabsSettings() (previously declared in /srv/mediawiki/wmf-config/InitialiseSettings-labs.php:81) in /srv/mediawiki/wmf-config/InitialiseSettings-labs.php on line 81
[12:33:46] <CFisch_WMDE>	 (172.16.4.119)
[12:34:07] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm)
[12:34:23] <zeljkof>	 CFisch_WMDE: revert?
[12:34:40] <CFisch_WMDE>	 nope I got this before already
[12:35:03] <CFisch_WMDE>	 ah and now at least one part of the patch seems to work
[12:35:46] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493383|Add throttle rule for Day of Digital Service (T217155)]] (duration: 00m 52s)
[12:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:52] <stashbot>	 T217155: Requesting temporary lift of IP cap - https://phabricator.wikimedia.org/T217155
[12:35:53] <zeljkof>	 Urbanecm: all deployed
[12:35:59] <Urbanecm>	 thx
[12:36:10] <zeljkof>	 CFisch_WMDE: I'm around in case a revert is needed
[12:36:15] <zeljkof>	 !log EU SWAT finished
[12:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:28] <zeljkof>	 CFisch_WMDE, Urbanecm: thank you for deploying with #releng ;)
[12:36:33] <Urbanecm>	 yw
[12:36:56] <CFisch_WMDE>	 zeljkof: addshore Could it be that there's an additional weird caching / propagation for new beta features?
[12:37:32] <zeljkof>	 CFisch_WMDE: I don't know how those work :/
[12:39:34] <wikibugs>	 (03CR) 10jenkins-bot: New throttle rule for Czech Wikigap 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493382 (https://phabricator.wikimedia.org/T217270) (owner: 10Urbanecm)
[12:39:36] <wikibugs>	 (03CR) 10jenkins-bot: Add throttle rule for Day of Digital Service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493383 (https://phabricator.wikimedia.org/T217155) (owner: 10Urbanecm)
[12:41:06] <CFisch_WMDE>	 Ok I'll just wait and see .... the part of the patch affecting the beta-cluster is working, maybe the rest will fix itself later. I'm grabbing lunch now.
[12:44:44] <addshore>	 CFisch_lunch: i cant think of any such cache :/
[12:44:51] <addshore>	 and i checked the settings and they look correct
[12:45:00] <addshore>	 CFisch_lunch: is the expectation that it now appears as a BF?
[12:45:04] <addshore>	 or was it doing that before?
[12:46:05] * zeljkof 's finger is hovering over big red REVERT button ;P
[12:48:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Sound fine to me, anyone on the deployment hosts can easily do greater damage than what the admin port grants them." [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[12:54:40] <addshore>	 CFisch_lunch: zeljkof is it in wgBetaFeaturesWhitelist ?
[12:55:43] <zeljkof>	 addshore: sorry, what?
[12:55:44] <zeljkof>	 :)
[12:57:50] <addshore>	 I think the BF hasn't been added to wgBetaFeaturesWhitelist, which is why it wont show :)_
[12:58:14] <zeljkof>	 ah, I really don't know how that works
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1300)
[13:18:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Scap: upgrade cloud VPS to 3.9.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani)
[13:18:10] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287)
[13:18:48] <wikibugs>	 (03PS3) 10Ammarpad: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070)
[13:26:38] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Thanks @Gilles !  I have tried again a 9000h migration and that worked, I've now merged the migrated storage with existing storage and wi...
[13:32:36] <wikibugs>	 (03PS1) 10Joal: Update sqoop timers templates to follow new CLI [puppet] - 10https://gerrit.wikimedia.org/r/493406 (https://phabricator.wikimedia.org/T215290)
[13:34:51] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet
[13:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:54] <wikibugs>	 (03PS1) 10Joal: Add ipblocks_restrictions table to labs sqoop list [puppet] - 10https://gerrit.wikimedia.org/r/493407 (https://phabricator.wikimedia.org/T209549)
[13:43:37] <godog>	 !log depool prometheus1003.eqiad.wmnet to take a data snapshot
[13:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[13:46:49] <wikibugs>	 (03PS2) 10Effie Mouzeli: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[13:47:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[13:47:11] <wikibugs>	 (03PS1) 10WMDE-Fisch: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905)
[13:47:27] <wikibugs>	 (03PS2) 10WMDE-Fisch: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905)
[13:48:55] <CFisch_remote>	 zeljkof: We know now, why it's not working -.-
[13:49:00] <CFisch_remote>	 we forgot this: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493409/
[13:50:38] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] update dashboard name so links to prometheus metric graphs work [software/tendril] - 10https://gerrit.wikimedia.org/r/493376 (owner: 10ArielGlenn)
[13:51:52] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[13:52:32] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1003.eqiad.wmnet
[13:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:01] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki)
[13:55:52] <wikibugs>	 (03Abandoned) 10Joal: Update sqoop timers templates to follow new CLI [puppet] - 10https://gerrit.wikimedia.org/r/493406 (https://phabricator.wikimedia.org/T215290) (owner: 10Joal)
[13:56:08] <wikibugs>	 (03Abandoned) 10Joal: Add ipblocks_restrictions table to labs sqoop list [puppet] - 10https://gerrit.wikimedia.org/r/493407 (https://phabricator.wikimedia.org/T209549) (owner: 10Joal)
[13:56:47] <elukey>	 !log re-start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952
[13:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:50] <stashbot>	 T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952
[14:00:05] <jouncebot>	 hashar: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1400).
[14:01:40] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki)
[14:01:46] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki)
[14:04:41] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: deploy common analytics repos [puppet] - 10https://gerrit.wikimedia.org/r/493411
[14:04:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[14:05:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: deploy common analytics repos [puppet] - 10https://gerrit.wikimedia.org/r/493411 (owner: 10Elukey)
[14:06:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Please merge by giving +2 at your convenience." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/493196 (owner: 10Alexandros Kosiaris)
[14:08:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 (owner: 10Muehlenhoff)
[14:11:30] <zeljkof>	 CFisch_remote: add it to the next swat? is it urgent?
[14:11:42] <wikibugs>	 (03PS1) 10Paladox: LocalUsernamesToLowerCase: Bind disabled GitReferenceUpdated instance [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493412
[14:11:52] <zeljkof>	 hashar is train conductor this week, maybe he can deploy it outside of swat
[14:12:00] <CFisch_remote>	 zeljkof: yeah would be cool if it goes out today 
[14:12:56] <CFisch_remote>	 I mean there's also a break later in the calendar - so maybe its also doable there
[14:13:34] <zeljkof>	 hashar would know, it's train now
[14:13:39] <icinga-wm>	 PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:13:39] <icinga-wm>	 PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:13:39] <icinga-wm>	 PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:13:40] <CFisch_remote>	 kk
[14:13:43] <CFisch_remote>	 thx
[14:13:45] <icinga-wm>	 PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:13:51] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:13:55] <icinga-wm>	 PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:14:37] <icinga-wm>	 PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused
[14:16:07] <icinga-wm>	 RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational
[14:16:07] <icinga-wm>	 RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient
[14:16:07] <icinga-wm>	 RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[14:16:15] <icinga-wm>	 RECOVERY - DPKG on notebook1003 is OK: All packages OK
[14:16:23] <icinga-wm>	 RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up
[14:17:19] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:17:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[14:18:01] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to access the admin port [puppet] - 10https://gerrit.wikimedia.org/r/493395 (https://phabricator.wikimedia.org/T211964)
[14:19:51] <icinga-wm>	 RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:24:32] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@f605fad]: New sqoop logic that uses the sharded replicas
[14:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:41] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:27:39] <hashar>	 bla bla 
[14:27:46] <hashar>	 going to do the group1 upgrade
[14:28:22] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/WikibaseMediaInfo: Move up checks to test if we should construct depicts widgets - T217285 (duration: 00m 58s)
[14:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:26] <stashbot>	 T217285: commonswiki / wikibase: Postcondition failed: Namespace for entity type property must be defined! - https://phabricator.wikimedia.org/T217285
[14:28:55] <wikibugs>	 (03PS1) 10Hashar: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417
[14:28:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar)
[14:30:39] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:30:41] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:30:47] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar)
[14:30:50] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging]
[14:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:52] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid cluster staging completed
[14:30:52] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid finished
[14:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:27] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:31:27] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:32:01] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:33:29] <akosiaris>	 hmmm kubernetes etcd is not having a good time in eqiad
[14:34:32] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@f605fad]: New sqoop logic that uses the sharded replicas (duration: 10m 00s)
[14:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:47] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK
[14:35:09] <akosiaris>	 hm latencies are dropping again
[14:36:05] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493417 (owner: 10Hashar)
[14:36:06] <akosiaris>	 this isn't action related though
[14:36:26] <akosiaris>	 as the staging cluster who uses a different etcd cluster did not suffer from that
[14:36:32] <akosiaris>	 s/did not/did
[14:36:42] <akosiaris>	 so both cluster suffered 
[14:37:29] <akosiaris>	 no create/delete actions but compareAndSwap, get and list all had skyrocketing latencies
[14:38:01] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:38:11] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:38:51] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:38:51] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:39:17] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:39:22] <akosiaris>	 nothing though on etcd logs
[14:39:42] <wikibugs>	 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) The core php72 packages had the extensions rebuilt for 7.2 were all rebuilt (in the correct ordering) within the...
[14:40:13] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19
[14:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:21] <akosiaris>	 ah no, finally found it
[14:40:22] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "Nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[14:40:23] <akosiaris>	 Feb 28 14:32:37 etcd1003 etcd[453]: the connection to peer 52897700ee9bcb73 is unhealthy
[14:40:23] <akosiaris>	 Feb 28 14:32:37 etcd1003 etcd[453]: the connection to peer 460d53f044bf905e is unhealthy
[14:40:27] <akosiaris>	 ok, networking issue it seems
[14:40:41] <hashar>	 [Exception MWException] (/srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
[14:40:43] <hashar>	 grmblblblb
[14:41:08] <logmsgbot>	 !log hashar@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.19 (duration: 00m 53s)
[14:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to reach the opcode endpoint [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964)
[14:42:58] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=rhodium.eqiad.wmnet
[14:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:10] <akosiaris>	 Feb 28 14:31:35 ganeti1008 kernel: [9245722.385430] block drbd2: Remote failed to finish a request within 43604ms > ko-count (7) * timeout (60 * 0.1s)
[14:43:21] <akosiaris>	 something networkingy looks like it
[14:43:51] <wikibugs>	 (03PS2) 10Elukey: Find db host and port using refinery [puppet] - 10https://gerrit.wikimedia.org/r/493331 (https://phabricator.wikimedia.org/T215290) (owner: 10Milimetric)
[14:43:57] <akosiaris>	 all drbd resources terminated and resumed
[14:44:42] <akosiaris>	 I am guessing something to do with ganeti1008 being re-racked (I am emptying it now)
[14:45:46] <hashar>	 so somehow wmf.19 lost the l10n cache for english language
[14:46:02] <hashar>	 well at least for some code paths / wikis / server 
[14:46:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Find db host and port using refinery [puppet] - 10https://gerrit.wikimedia.org/r/493331 (https://phabricator.wikimedia.org/T215290) (owner: 10Milimetric)
[14:46:24] <hashar>	 oh
[14:46:27] <hashar>	 that is just on mw1272
[14:46:28] <logmsgbot>	 !log hashar@deploy1001 scap sync-l10n completed (1.33.0-wmf.19) (duration: 03m 33s)
[14:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:44] <hashar>	 !log mw1272 had /srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
[14:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:25] <hashar>	 !log mw1272 fixed by running "scap sync-l10n" from deploy host
[14:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14921/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[14:50:01] <jbond42>	 !log reboot cloudnet2001-dev.codfw.wmnet
[14:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:12] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: php::monitoring: allow deployment hosts to reach the opcode endpoint [puppet] - 10https://gerrit.wikimedia.org/r/493418 (https://phabricator.wikimedia.org/T211964)
[14:50:40] <_joe_>	 hashar: I think mw1272 was down/in manitenance for some time
[14:52:37] <icinga-wm>	 PROBLEM - Host cloudnet2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[14:53:01] <hashar>	 _joe_: yeah I am assuming that as well. anyway syncing l10n did bring it up to date ;)
[14:53:49] <wikibugs>	 (03CR) 10Nikerabbit: [C: 04-1] Enable edittag for ExternalGuidance in CX and VE (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry)
[14:54:59] <hashar>	 !sal
[14:54:59] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log  https://tools.wmflabs.org/sal/production   See it and you will know all you need.
[14:55:47] <icinga-wm>	 RECOVERY - Host cloudnet2001-dev is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[14:59:37] <wikibugs>	 (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421
[15:01:37] <icinga-wm>	 PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:50] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe)
[15:02:24] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe) a:03Mathew.onipe
[15:02:48] <wikibugs>	 (03PS2) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421 (https://phabricator.wikimedia.org/T216993)
[15:10:50] <wikibugs>	 (03PS2) 10Herron: add interface::add_ip6_mapped to default production node definition [puppet] - 10https://gerrit.wikimedia.org/r/480537 (https://phabricator.wikimedia.org/T102099)
[15:13:37] <wikibugs>	 (03CR) 10Herron: [C: 03+2] add interface::add_ip6_mapped to default production node definition [puppet] - 10https://gerrit.wikimedia.org/r/480537 (https://phabricator.wikimedia.org/T102099) (owner: 10Herron)
[15:14:00] <wikibugs>	 (03PS1) 10Jbond: Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802)
[15:14:24] <wikibugs>	 (03PS2) 10Jbond: Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802)
[15:14:41] <_joe_>	 !log uploading scap 3.9.1-1 to {stretch,jessie}-wikimedia
[15:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:42] <cmjohnson1>	 !log powering off db1114 to replace motherboard T214720
[15:15:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Offline rhodium.eqiad.wmnet so it can be rebooted [puppet] - 10https://gerrit.wikimedia.org/r/493422 (https://phabricator.wikimedia.org/T216802) (owner: 10Jbond)
[15:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:46] <stashbot>	 T214720: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720
[15:16:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[15:17:10] <_joe_>	 thcipriani: ^^ I'm merging the patches that will bring scap 3.9.1 to beta
[15:17:28] <_joe_>	 then we can test it I guess in production
[15:17:49] <_joe_>	 (I love when I can say "we can test it in production")
[15:17:59] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: scap: do not manage package versions via puppet by default [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287)
[15:18:45] <thcipriani>	 _joe_: sure, I'm still finishing morning-routine pre-meeting stuff, but beta already, actually, overrides scap::version (managed via cumin there) so that patch is fine. It'll affect other labs projects that have scap installed for whatever reason.
[15:19:07] <wikibugs>	 (03PS2) 10Muehlenhoff: confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882
[15:19:13] <jbond42>	 !log rebooting rhodium
[15:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:16] <thcipriani>	 _joe_: I'll be more around in 15 if I'm needed for testing
[15:19:24] <_joe_>	 thcipriani: take your time
[15:19:36] <_joe_>	 we can test things only quite some time later
[15:21:15] <wikibugs>	 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks, @santhosh. When you say "context detection code", I take that to mean inc...
[15:21:35] <icinga-wm>	 PROBLEM - Host rhodium is DOWN: PING CRITICAL - Packet loss = 100%
[15:22:22] <wikibugs>	 (03PS3) 10Muehlenhoff: confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882
[15:23:01] <icinga-wm>	 RECOVERY - Host rhodium is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[15:23:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] confd: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/490882 (owner: 10Muehlenhoff)
[15:23:27] <icinga-wm>	 PROBLEM - Host db1114.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:33] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Initial tox setup [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493297 (owner: 10Volans)
[15:23:35] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1005.eqiad.wmnet
[15:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:07] <wikibugs>	 (03PS1) 10Urbanecm: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311)
[15:26:17] <wikibugs>	 (03PS1) 10Jbond: Add rhodium.eqiad.wmnet back into service [puppet] - 10https://gerrit.wikimedia.org/r/493426 (https://phabricator.wikimedia.org/T216802)
[15:26:53] <wikibugs>	 (03PS9) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196)
[15:27:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add rhodium.eqiad.wmnet back into service [puppet] - 10https://gerrit.wikimedia.org/r/493426 (https://phabricator.wikimedia.org/T216802) (owner: 10Jbond)
[15:27:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse)
[15:27:56] <wikibugs>	 (03PS1) 10Urbanecm: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336)
[15:29:37] <jbond42>	 !log rebooting labstore2001
[15:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:55] <icinga-wm>	 PROBLEM - Host ms-fe1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:57] <icinga-wm>	 RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:33:23] <jbond42>	 !log rebooting labstore2002
[15:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:49] <wikibugs>	 (03PS8) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123)
[15:35:15] <wikibugs>	 (03PS1) 10Herron: kafka-logging: replace logstash1005 with logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898)
[15:36:44] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1005.eqiad.wmnet
[15:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:39] <jbond42>	 !log rebooting labsdb1006
[15:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:45] <wikibugs>	 (03PS1) 10Herron: logstash: disable notifications on logstash1005 and logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898)
[15:38:15] <icinga-wm>	 RECOVERY - Host ms-fe1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.12 ms
[15:38:17] <icinga-wm>	 PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[15:38:59] <herron>	 hmm, is ganeti1008 expected?
[15:39:06] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[15:39:09] <jbond42>	 its not me
[15:39:50] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64
[15:40:15] <godog>	 yeah ganeti1008 was being moved, cc akosiaris 
[15:40:27] <thcipriani>	 _joe_: I'm around now if you are
[15:40:39] <herron>	 cool, was just looking at the syslogs. ok will leave alone
[15:41:18] <_joe_>	 thcipriani: uhm, yes, so should I just install scap on deploy1001/1002 for now?
[15:41:54] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 GTirloni T215892
[15:42:04] <icinga-wm>	 ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute GTirloni T215892
[15:42:10] <icinga-wm>	 ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute GTirloni T215892
[15:42:30] <thcipriani>	 _joe_: I think that'll be good enough to test new functionality
[15:42:46] <marostegui>	 Thanks cmjohnson1! :)
[15:43:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Set wgWMEPhp7SamplingRate to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493431
[15:43:18] <_joe_>	 thcipriani: assuming my puppet changes have arrived everywhere, yes
[15:43:31] <jbond42>	 !log rebooting labsdb1007
[15:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:44] <_joe_>	 thcipriani: uh now that I think of it, I might need to merge one further change
[15:43:57] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1006.eqiad.wmnet
[15:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:23] <_joe_>	 thcipriani: yeah sorry
[15:45:48] <_joe_>	 thcipriani: actually, we can test things right away
[15:46:02] <thcipriani>	 _joe_: are all the new dsh files setup for mw_web_clusters?
[15:46:11] <_joe_>	 thcipriani: exactly that
[15:46:14] <thcipriani>	 :)
[15:46:17] <_joe_>	 we'll override with -D
[15:46:25] <thcipriani>	 k
[15:46:36] <wikibugs>	 (03PS2) 10Herron: logstash: disable notifications on logstash1005 and logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898)
[15:46:47] <_joe_>	 !log install scap 3.9.1-1 on the deployment servers
[15:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:22] <wikibugs>	 (03PS1) 10Muehlenhoff: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433
[15:47:24] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14926/" [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[15:48:43] <icinga-wm>	 PROBLEM - Host ores1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:53] <_joe_>	 thcipriani: installed, please let's first test a normal null run
[15:48:58] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging]
[15:48:59] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid cluster staging completed
[15:48:59] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm citoid finished
[15:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:16] <thcipriani>	 _joe_: doing
[15:49:17] <_joe_>	 jijiki: can you take a look at ores1002? 
[15:49:30] <akosiaris>	 it's being moved
[15:49:33] <akosiaris>	 let it be
[15:49:38] <_joe_>	 oh ok
[15:49:46] <_joe_>	 #-dcops I guess
[15:50:21] <icinga-wm>	 PROBLEM - puppet last run on backup2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf]
[15:50:47] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized README: noop sync scap 3.9.1-1 (duration: 00m 52s)
[15:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:49] <icinga-wm>	 PROBLEM - puppet last run on sessionstore2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf]
[15:52:01] <icinga-wm>	 PROBLEM - Host ores1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:52:06] <thcipriani>	 _joe_: no-php opcache sync-file looks good spot checking a few things
[15:52:23] <_joe_>	 cool
[15:52:40] <wikibugs>	 (03Abandoned) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493421 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe)
[15:52:42] <wikibugs>	 (03PS1) 10DCausse: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435
[15:52:44] <wikibugs>	 (03PS1) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[15:52:47] <_joe_>	 so, to test the whole thing
[15:52:54] <jbond42>	 !log rebooting labsdb1004
[15:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:01] <wikibugs>	 (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993)
[15:53:50] <_joe_>	 thcipriani: scap .... -Dphp7_admin_port:9181 -Dmw_web_clusters:mediawiki-appserver-canaries,mediawiki-api-canaries ?
[15:53:53] <wikibugs>	 (03PS2) 10DCausse: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435
[15:53:56] <wikibugs>	 (03PS2) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[15:54:19] <thcipriani>	 _joe_: I'll go with those
[15:54:45] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:54:56] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) Proposal for removal:  `registry brokers services etc consumers`  @Ottomata what do you think?
[15:55:23] <icinga-wm>	 RECOVERY - puppet last run on backup2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[15:55:26] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey)
[15:55:49] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[15:55:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse)
[15:56:41] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized README: noop sync to test opcache-manager in scap 3.9.1-1 (duration: 00m 53s)
[15:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:53] <icinga-wm>	 RECOVERY - puppet last run on sessionstore2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:58:26] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10Ottomata) don't know about registry, services or etc, but /brokers and /consumers should be leftover from when we might have had a...
[15:58:44] <wikibugs>	 10Operations, 10Discovery-Search: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10Mathew.onipe)
[16:01:25] <icinga-wm>	 RECOVERY - Host db1114.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.62 ms
[16:01:26] <addshore>	 jouncebot: now
[16:01:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[16:01:43] <addshore>	 CFisch_remote: you meant now right? :)
[16:02:37] <CFisch_remote>	 yay
[16:02:54] <addshore>	 is there a patch? :)
[16:03:11] <icinga-wm>	 PROBLEM - Host db1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:15] <wikibugs>	 (03PS12) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[16:03:42] <CFisch_remote>	 addshore: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493409/
[16:03:46] <wikibugs>	 (03PS9) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123)
[16:05:25] <wikibugs>	 (03PS3) 10Addshore: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[16:05:50] <addshore>	 ^^ if noone has any objections I'm going to push that one out in the next few mins (a followup from swat earlier)
[16:06:26] <wikibugs>	 (03PS1) 10Herron: rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898)
[16:06:27] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) /etc is my fault when I've set up burrow the first time, and registry/services seems to be @joal's slider test (so safe to...
[16:08:18] <jbond42>	 !log rebooting labstore2003
[16:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:33] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[16:08:46] <wikibugs>	 (03PS2) 10Herron: rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898)
[16:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[16:10:04] <wikibugs>	 (03CR) 10Herron: [C: 03+2] rsyslog: replace logstash1004 with logstash1010 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493440 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[16:11:10] <addshore>	 CFisch_remote: it is on mwdebug1002
[16:11:56] <CFisch_remote>	 checking
[16:13:33] <CFisch_remote>	 works like a charm \o/
[16:13:37] <icinga-wm>	 RECOVERY - Host db1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms
[16:13:39] <CFisch_remote>	 can be deployed addshore 
[16:13:41] <icinga-wm>	 PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:13:42] <addshore>	 CFisch_remote: amazing
[16:13:49] <wikibugs>	 (03PS13) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[16:14:46] <addshore>	 syncing
[16:15:35] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T214905 Add ReferencePreviews to allowed BetaFeatures (duration: 00m 54s)
[16:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:39] <stashbot>	 T214905: Show referencePreviews on group0 wikis - https://phabricator.wikimedia.org/T214905
[16:15:41] <addshore>	 CFisch_remote: all done!
[16:18:20] <wikibugs>	 (03PS14) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[16:19:49] <wikibugs>	 (03CR) 10jenkins-bot: Add ReferencePreviews to allowed BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch)
[16:20:45] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) >>! In T187987#4990906, @fgiunchedi wrote: > Thanks @Gilles ! >  > I have tried again a 9000h migration and that worked, I've now merged...
[16:22:55] <wikibugs>	 (03PS2) 10Herron: kafka-logging: replace logstash1005 with logstash1011 [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898)
[16:23:11] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1002/14930/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[16:23:14] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[16:27:56] <herron>	 !log migrating kafka on logstash1005 to logstash1011 T213898
[16:27:56] <herron>	 !log migrating kafka on logstash1005 to logstash1011 T213898
[16:27:57] <icinga-wm>	 RECOVERY - Host ores1002 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
[16:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:00] <stashbot>	 T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898
[16:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:50] <herron>	 double the logs double the fun…  irc client disconnected/reconnected grr
[16:29:05] <icinga-wm>	 RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 36.91 ms
[16:29:34] <wikibugs>	 (03CR) 10EBernhardson: cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[16:29:39] <icinga-wm>	 RECOVERY - Host ores1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.35 ms
[16:29:50] <wikibugs>	 (03PS4) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229)
[16:30:05] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14928/" [puppet] - 10https://gerrit.wikimedia.org/r/493429 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[16:30:34] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics Set kafka compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493443 (https://phabricator.wikimedia.org/T206785)
[16:32:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ganeti1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[16:32:39] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:34:24] <wikibugs>	 (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[16:36:33] <icinga-wm>	 PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:37:19] <wikibugs>	 (03PS1) 10Ottomata: eventgate: set compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785)
[16:39:00] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) Got down to:  ` [zk: localhost:2181(CONNECTED) 37] ls / [zookeeper, yarn-leader-election, hadoop-ha, hive_zookeeper_namesp...
[16:39:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey)
[16:39:57] <elukey>	 !log clean up old/stale zookeeper znodes from conf100[4-6] - T216979
[16:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:01] <stashbot>	 T216979: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979
[16:43:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ganeti1008 is OK: OK ferm input default policy is set
[16:43:21] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1006.eqiad.wmnet
[16:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:46] <icinga-wm>	 RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[16:48:55] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] Upgrade logstash plugins to 5.6.14 (031 comment) [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe)
[16:50:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "A couple thoughts but don't feel strongly; this looks pretty good." (032 comments) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (owner: 10Volans)
[16:54:42] <icinga-wm>	 PROBLEM - Host ms-be1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:55:51] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: switch from thirdparty/php72 to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712)
[16:56:13] <cdanis>	 the ms-be1030 alert must be T212348 ?
[16:56:13] <stashbot>	 T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348
[16:56:47] <wikibugs>	 (03CR) 10BryanDavis: "Needs testing in tools-beta or similar before deploying widely" [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[17:00:04] <jouncebot>	 godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:00:28] <godog>	 \o/
[17:00:46] <godog>	 cdanis: indeed, downtimed the host but not the mgmt
[17:05:20] <icinga-wm>	 RECOVERY - Host ms-be1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[17:06:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] eventgate: set compression.codec: snappy and message.max.bytes: 4194304 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata)
[17:06:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[17:08:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff)
[17:08:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: package_builder: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff)
[17:09:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/493433 (owner: 10Muehlenhoff)
[17:09:38] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10Andrew)
[17:10:31] <wikibugs>	 (03CR) 10Nikerabbit: [C: 04-1] Enable edittag for ExternalGuidance in CX and VE (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry)
[17:10:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 (owner: 10DCausse)
[17:11:20] <addshore>	 _joe_: https://phabricator.wikimedia.org/T217335 FYI, something missbehaving on beta when the php7 BF is on
[17:11:29] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Move cloudvirt1018 to a 10G rack, connect 10G nics - https://phabricator.wikimedia.org/T217347 (10Andrew)
[17:11:41] <wikibugs>	 (03PS5) 10CRusnov: Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229)
[17:11:57] <_joe_>	 addshore: good to know, I guess more details are neeeded
[17:12:09] <_joe_>	 and probably you want to involve people in core platform as well
[17:12:29] <_joe_>	 if you're sure it has to do with php7, please add "php7.2-support" as a tag
[17:12:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add .gitreview [cookbooks] - 10https://gerrit.wikimedia.org/r/493435 (owner: 10DCausse)
[17:12:34] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Add /etc/default/ganeti to allow rapi to listen to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/493349 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov)
[17:13:10] <addshore>	 yup, check checked on prod (group0) and the issue doesnt seem to show its face there, which is good. I'll go and tag the task now
[17:13:31] <wikibugs>	 (03PS2) 10Jbond: Load apt when testing base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/490837
[17:14:15] <_joe_>	 addshore: you can verify with XWD
[17:14:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Load apt when testing base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/490837 (owner: 10Jbond)
[17:14:47] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] cloudelastic: Add cloudelastic configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[17:14:50] <addshore>	 _joe_: that ticket i linked you was totally the wrong one
[17:15:16] <_joe_>	 uh?
[17:15:34] <_joe_>	 which one is the correct one then?
[17:15:34] <wikibugs>	 (03PS3) 10Jbond: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff)
[17:16:04] <icinga-wm>	 PROBLEM - Host ms-be1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:16:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff)
[17:16:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10Andrew) the next step is moving this to a new rack for 10G connections (either  2, 4 or 7 in row B) so I'm tagging dc-ops.  You can ha...
[17:16:54] <addshore>	 _joe_: https://phabricator.wikimedia.org/T217323
[17:17:04] <addshore>	 I even started adding comments to the wrong ticket, think my work day is over
[17:17:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10Andrew) Steps:  [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Upd...
[17:17:33] <_joe_>	 addshore: ahah ok, but tbh it's easier to solve
[17:17:40] <addshore>	 yup
[17:19:12] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Initial tox setup [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493297 (owner: 10Volans)
[17:19:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10Andrew) Steps:  [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Update switch config [] Re-image
[17:19:26] <icinga-wm>	 PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti]
[17:19:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1018 to a 10G rack, connect 10G nics - https://phabricator.wikimedia.org/T217347 (10Andrew) Steps:  [] Move host to a rack with 10G -- B2, B4 or B7 I believe [] Enable the 10G nic in the bios [] Move/install cables [] Update...
[17:19:48] <wikibugs>	 (03PS4) 10Jbond: Create /etc/debdeploy-autorestarts.conf which lists all automated restarts [puppet] - 10https://gerrit.wikimedia.org/r/493401 (owner: 10Muehlenhoff)
[17:19:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1024 to 10Gb ethernet - https://phabricator.wikimedia.org/T216724 (10Andrew)
[17:20:50] <icinga-wm>	 PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti]
[17:20:54] <icinga-wm>	 PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti]
[17:21:06] <chaomodus>	 yep that's me 
[17:21:08] <chaomodus>	 i broken it
[17:21:22] <icinga-wm>	 RECOVERY - Host ms-be1028.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 37.26 ms
[17:22:32] <icinga-wm>	 PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti]
[17:22:40] <wikibugs>	 (03PS10) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123)
[17:23:38] <jynus>	 !log recreating replicas, master ops events for db1078, db1075 T213858
[17:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:41] <stashbot>	 T213858: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858
[17:24:07] <cmjohnson1>	 !log powering down sodium to move racks T212348
[17:24:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:10] <stashbot>	 T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348
[17:25:32] <wikibugs>	 (03PS1) 10CRusnov: ganeti: Fix path to etc/default file. [puppet] - 10https://gerrit.wikimedia.org/r/493456
[17:26:55] <_joe_>	 addshore: can you confirm if you still see the problem or not?
[17:27:00] <icinga-wm>	 PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100%
[17:27:16] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] ganeti: Fix path to etc/default file. [puppet] - 10https://gerrit.wikimedia.org/r/493456 (owner: 10CRusnov)
[17:27:41] <addshore>	 _joe_: solved for me
[17:27:42] <addshore>	 ty
[17:28:24] <icinga-wm>	 PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ganeti]
[17:28:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gtirloni on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1015.eqiad.wmnet ` The log can be found in `/var/log/wmf-a...
[17:28:48] <_joe_>	 I just restarted php7.2-fpm
[17:29:00] <icinga-wm>	 PROBLEM - Host sodium.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:29:34] <icinga-wm>	 RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:31:35] <wikibugs>	 (03PS3) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[17:32:41] <elukey>	 ah lovely I was wondering why my pxe install was not working
[17:32:45] <elukey>	 sodium is down?
[17:32:46] <icinga-wm>	 RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:33:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse)
[17:33:10] <elukey>	 ah it is moving
[17:33:12] <elukey>	 bad timing
[17:33:20] * elukey cries in a corner
[17:33:32] <icinga-wm>	 RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:34:30] <Reedy>	 elukey: Go torrent a linux iso
[17:34:40] * Reedy looks around
[17:35:09] * elukey hands labsdb1012 install over to Reedy 
[17:36:14] <icinga-wm>	 RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:36:16] <icinga-wm>	 RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:36:20] <icinga-wm>	 RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms
[17:36:44] <icinga-wm>	 RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms
[17:36:52] <icinga-wm>	 PROBLEM - Host ms-be1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:38:36] <wikibugs>	 (03PS4) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[17:39:37] <wikibugs>	 (03PS1) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993)
[17:39:42] <icinga-wm>	 PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libguestfs-tools]
[17:39:48] <icinga-wm>	 RECOVERY - Host sodium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms
[17:39:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse)
[17:40:56] <wikibugs>	 (03Abandoned) 10Mathew.onipe: Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493437 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe)
[17:42:12] <icinga-wm>	 RECOVERY - Host ms-be1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[17:42:32] <wikibugs>	 (03PS5) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[17:44:23] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T217301 (10Papaul) a:05Papaul→03Marostegui Disk replacement complete.
[17:45:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) 05Open→03Resolved the motherboard has been replaced, the idrac and bios have been updated to latest version. resolving task, reopen if there are any problems.
[17:47:35] <wikibugs>	 (03PS1) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463
[17:47:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) FYI I solved the issue disabling the SD card support in: System Configuration -> Bios/Platform configuration -> System Options -> Usb...
[17:49:17] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe)
[17:49:18] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[17:51:46] <herron>	 !log logstash1011 kafka now in sync.  transitioning logstash1005 to spare system T213898
[17:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:49] <stashbot>	 T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898
[17:52:34] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron)
[17:52:46] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron)
[17:53:19] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse)
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1800).
[18:00:46] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on sodium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T217356
[18:00:51] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T217356 (10ops-monitoring-bot)
[18:01:58] <icinga-wm>	 RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:04:10] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "ready to re-enable alerts now, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/493430 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[18:04:20] <wikibugs>	 (03PS1) 10Herron: Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470
[18:04:28] <wikibugs>	 (03PS2) 10Herron: Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470
[18:05:30] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "logstash: disable notifications on logstash1005 and logstash1011" [puppet] - 10https://gerrit.wikimedia.org/r/493470 (owner: 10Herron)
[18:07:19] <wikibugs>	 10Operations, 10Analytics, 10Services: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10Ottomata)
[18:07:45] <wikibugs>	 (03PS6) 10DCausse: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436
[18:10:34] <wikibugs>	 (03PS1) 10Herron: kafka-logging: replace logstash1006 with logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898)
[18:12:11] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on logstash1011 is CRITICAL: 4.978e+06 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011
[18:13:14] <elukey>	 herron: ---^
[18:13:21] <elukey>	 probably due to the migration
[18:13:59] <herron>	 hmm yeah, no data points?
[18:14:16] <herron>	 oh nvm
[18:14:46] <herron>	 it’s in sync now, this should clear
[18:17:09] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on logstash1011 is CRITICAL: 8.235e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011
[18:19:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1015.eqiad.wmnet'] `  and were **ALL** successful.
[18:19:09] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Nuria) p:05High→03Triage
[18:19:18] <herron>	 these are false positives from while this host was syncing up. not sure why icinga is throwing critical. l ooking
[18:21:12] <herron>	 "Alert if large replica lag for more than 50% of the time in the last 30 minutes.”
[18:21:26] <robh>	 !log cp1076 power down for network port move
[18:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:41] <chaomodus>	 death to false positives
[18:21:58] <herron>	 that’s a lot easier to understand than 8.235e+05 ge 5e+05
[18:22:29] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on logstash1011 is OK: (C)5e+05 ge (W)1e+05 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011
[18:23:29] <icinga-wm>	 PROBLEM - Host cp1076 is DOWN: PING CRITICAL - Packet loss = 100%
[18:26:55] <wikibugs>	 (03PS1) 10Herron: logstash: disable notifications on logstash1006 and logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493476 (https://phabricator.wikimedia.org/T213898)
[18:27:52] <wikibugs>	 (03CR) 10Herron: [C: 03+2] logstash: disable notifications on logstash1006 and logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493476 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[18:28:05] <icinga-wm>	 RECOVERY - Host cp1076 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms
[18:28:20] <robh>	 !log cp1077 power off for network port relocation
[18:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:56] <XioNoX>	 !log stop pybal on lvs1016 - T212348
[18:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:00] <stashbot>	 T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348
[18:29:15] <wikibugs>	 (03PS10) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196)
[18:30:27] <icinga-wm>	 PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100%
[18:30:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse)
[18:31:31] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=46)
[18:31:35] <wikibugs>	 (03CR) 10Ottomata: "Neither of these will matter for devs.  snappy is fine and should be used anyway.  The message size limit should be set the same for devel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/493444 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata)
[18:31:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[18:31:51] <icinga-wm>	 PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[18:33:29] <wikibugs>	 (03CR) 10CRusnov: "according to the documentation, this should allow core dumps to form (for example if the value of core_limit is to to 'unlimited') from se" [puppet] - 10https://gerrit.wikimedia.org/r/493294 (owner: 10CRusnov)
[18:33:43] <wikibugs>	 (03PS1) 10Herron: rsyslog: replace logstash1005 with logstash1011 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493477 (https://phabricator.wikimedia.org/T213898)
[18:34:24] <robh>	 !log cp1078 power down for network move
[18:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:35] <icinga-wm>	 RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 37.62 ms
[18:36:01] <icinga-wm>	 PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100%
[18:38:19] <wikibugs>	 (03Abandoned) 10Ottomata: eventgate-analytics Set kafka compression.codec: snappy and message.max.bytes: 4194304 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493443 (https://phabricator.wikimedia.org/T206785) (owner: 10Ottomata)
[18:39:18] <wikibugs>	 (03CR) 10Herron: [C: 03+2] rsyslog: replace logstash1005 with logstash1011 in kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/493477 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[18:41:39] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:41] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:41:45] <icinga-wm>	 PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:41:49] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:51] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:51] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:55] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:57] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:41:57] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:05] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:15] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:15] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:15] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:15] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:17] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:17] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:17] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:21] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:23] <icinga-wm>	 PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:23] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:27] <icinga-wm>	 PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:27] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:29] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:31] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:31] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:31] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:33] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:33] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:33] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:39] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:39] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:39] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp1078_v4, cp1078_v6
[18:42:45] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:46] <chaomodus>	 loud
[18:42:47] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:47] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:42:49] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp1078_v4, cp1078_v6
[18:43:27] <icinga-wm>	 RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[18:43:27] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK
[18:43:27] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK
[18:43:27] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK
[18:43:29] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK
[18:43:31] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK
[18:43:31] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK
[18:43:31] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK
[18:43:33] <XioNoX>	 !log start pybal on lvs1016 - T212348
[18:43:35] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK
[18:43:37] <icinga-wm>	 RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK
[18:43:37] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK
[18:43:41] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK
[18:43:41] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK
[18:43:43] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK
[18:43:43] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK
[18:43:45] <wikibugs>	 (03PS9) 10Herron: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390
[18:43:47] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK
[18:43:47] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK
[18:43:47] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK
[18:43:47] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK
[18:43:47] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK
[18:43:53] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK
[18:43:53] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK
[18:43:53] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK
[18:43:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy
[18:44:01] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK
[18:44:01] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK
[18:44:01] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK
[18:44:03] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK
[18:44:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron)
[18:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:07] <stashbot>	 T212348: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348
[18:44:07] <icinga-wm>	 RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[18:44:26] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK
[18:44:30] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK
[18:46:03] <wikibugs>	 (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/14931/" [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[18:46:05] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul)
[18:46:28] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) p:05Triage→03Normal
[18:46:56] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 46 connections established with conf1004.eqiad.wmnet:4001 (min=46)
[18:47:54] <wikibugs>	 (03CR) 10Bstorm: labstore: convert our first systemd timer to the new format (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm)
[18:47:56] <wikibugs>	 (03PS1) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365)
[18:49:37] <wikibugs>	 (03PS2) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818)
[18:50:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm)
[18:50:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484
[18:50:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: fix php version, add php7 admin port [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964)
[18:50:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: stop revalidating opcache [puppet] - 10https://gerrit.wikimedia.org/r/493486 (https://phabricator.wikimedia.org/T211964)
[18:52:06] <herron>	 !log migrating logstash1006 kafka to logstash1012 T213898
[18:52:09] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484
[18:52:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:10] <stashbot>	 T213898: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898
[18:52:22] <moritzm>	 !log installing libgd security updates on trusty
[18:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:56] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2033 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2033&var-datasource=codfw+prometheus/ops
[18:53:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: add specific groups for mediawiki web clusters [puppet] - 10https://gerrit.wikimedia.org/r/493484 (owner: 10Giuseppe Lavagetto)
[18:53:53] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "As was done with logstash100[45], I've manually stopped kafka on logstash1006 and will leave puppet disabled there until after logstash101" [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron)
[18:54:10] <wikibugs>	 (03PS2) 10Herron: kafka-logging: replace logstash1006 with logstash1012 [puppet] - 10https://gerrit.wikimedia.org/r/493471 (https://phabricator.wikimedia.org/T213898)
[18:54:29] <wikibugs>	 (03PS3) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818)
[18:54:38] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK
[18:55:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm)
[18:59:38] <wikibugs>	 (03PS2) 10Smalyshev: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684)
[18:59:50] <icinga-wm>	 RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK
[19:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T1900).
[19:00:04] <jouncebot>	 kostajh, Smalyshev, Zoranzoki21, Urbanecm, ottomata, and Pchelolo: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:10] <Urbanecm>	 here
[19:00:11] <SMalyshev>	 here
[19:00:14] <ottomata>	 here
[19:01:08] <wikibugs>	 (03PS4) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818)
[19:01:32] <kostajh>	 here
[19:01:38] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK
[19:03:07] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "nice! test went well today, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[19:03:16] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK
[19:03:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Upstart job [puppet] - 10https://gerrit.wikimedia.org/r/493489
[19:05:02] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK
[19:05:21] <thcipriani>	 I can SWAT
[19:06:48] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK
[19:06:48] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK
[19:06:48] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK
[19:07:02] <wikibugs>	 (03PS2) 10Thcipriani: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan)
[19:07:17] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan)
[19:07:20] <Niharika>	 Thanks thcipriani. 
[19:07:52] <thcipriani>	 sure thing :)
[19:08:26] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan)
[19:08:40] <wikibugs>	 (03CR) 10jenkins-bot: GrowthExperiments: Start help panel experiment on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493287 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan)
[19:09:04] <wikibugs>	 (03PS5) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818)
[19:09:17] <thcipriani>	 kostajh: your change is live on mwdebug1002 if there's anything to check there.
[19:09:38] <kostajh>	 thcipriani: just a moment
[19:12:00] <wikibugs>	 (03CR) 10Muehlenhoff: "With regard to updating the docs for the rollout, there's existing documentation for debdeploy which can be repurposed (or probably bets t" [puppet] - 10https://gerrit.wikimedia.org/r/493404 (https://phabricator.wikimedia.org/T217287) (owner: 10Giuseppe Lavagetto)
[19:12:09] <kostajh>	 thcipriani: you can deploy it
[19:12:12] <kostajh>	 thanks!
[19:12:18] <thcipriani>	 going live
[19:14:16] <icinga-wm>	 PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100%
[19:15:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks fine, but I haven't tested this further yet, I can ping this review set once the tests for the app servers are completed." [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[19:15:26] <thcipriani>	 hrm, one of these hosts seems stuck, maybe mw1272?
[19:15:46] <thcipriani>	 yes, 1272
[19:16:09] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:493287|GrowthExperiments: Start help panel experiment on viwiki]] T215666 (duration: 03m 02s)
[19:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:13] <stashbot>	 T215666: Help panel: deploy on Vietnamese Wikipedia - https://phabricator.wikimedia.org/T215666
[19:16:24] <thcipriani>	 ^ kostajh live now
[19:16:34] <kostajh>	 thcipriani: cheers!
[19:16:37] <icinga-wm>	 PROBLEM - Host mw1272.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:16:55] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.080 second response time
[19:17:04] <wikibugs>	 (03PS2) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365)
[19:17:08] <thcipriani>	 anyone around who could look into mw1272?
[19:17:55] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time
[19:18:16] <thcipriani>	 robh: could you take a look at mw1272 or wrangle the appropriate person (I see you're on clinic duty)?
[19:18:17] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time
[19:18:42] <robh>	 sure
[19:18:50] <thcipriani>	 thanks :)
[19:19:15] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[19:19:17] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time
[19:19:26] <wikibugs>	 (03CR) 10Aaron Schulz: "If the "worst" reply of all routes used in the SET broadcast is NOT_STORED (e.g. due to NullRoute) and that is given back to either of the" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[19:19:48] <thcipriani>	 SMalyshev: after your change merges it'll go out with the next run of https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
[19:20:02] <SMalyshev>	 thcipriani: ok cool
[19:20:08] <wikibugs>	 (03PS10) 10Herron: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390
[19:20:15] <SMalyshev>	 should only change anything for beta
[19:20:37] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time
[19:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[19:21:11] <wikibugs>	 (03CR) 10jenkins-bot: Enable WikibaseCirrusSearch on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493321 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev)
[19:21:35] <robh>	 !log mw1272 unresponsive to mgmt or production interfaces
[19:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:39] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.078 second response time
[19:22:11] <robh>	 !log mw1272 being worked on by onsite
[19:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:21] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm)
[19:23:37] <icinga-wm>	 RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms
[19:23:44] <wikibugs>	 (03CR) 10Thcipriani: "invalid date in here somewhere. I'll look at the end of SWAT if I have time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:24:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm)
[19:25:21] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time
[19:25:37] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) Received the parts, replaced CPU2 and DIMM B1 and cleared the log  Return shipping info  USPS 9202 3946 2441 1124 14 FEDEX 9611918 2393026 77862432
[19:25:37] <icinga-wm>	 RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms
[19:26:20] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:493321|Enable WikibaseCirrusSearch on Beta Cluster]] (beta only change/noop sync) T215684 (duration: 00m 55s)
[19:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:24] <stashbot>	 T215684: Deploy & test WikibaseCirrusSearch on beta cluster - https://phabricator.wikimedia.org/T215684
[19:26:25] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.080 second response time
[19:27:39] <thcipriani>	 Urbanecm: all merged, syncing your change now. 
[19:27:45] <Urbanecm>	 ok
[19:28:21] <thcipriani>	 Urbanecm: if you have a few minutes could you take a look at https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493375/ while I sync out the last couple changes? If you don't have time that's OK, too.
[19:28:48] <Urbanecm>	 thcipriani, sure
[19:28:54] <thcipriani>	 thank you
[19:29:16] <Urbanecm>	 yw
[19:30:09] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time
[19:30:29] <wikibugs>	 (03PS2) 10Thcipriani: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko)
[19:30:38] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493427|Add throttle rule for Art+Feminism 2019 editathon]] T217336 (duration: 00m 54s)
[19:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:53] <stashbot>	 T217336: Lift IP cap on en.wiki for account creation for Jewish Museum NYC - Sunday March 3 - https://phabricator.wikimedia.org/T217336
[19:31:14] <wikibugs>	 (03CR) 10jenkins-bot: Add throttle rule for Art+Feminism 2019 editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493427 (https://phabricator.wikimedia.org/T217336) (owner: 10Urbanecm)
[19:31:17] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.079 second response time
[19:31:23] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko)
[19:31:45] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.027 second response time
[19:32:16] <ottomata>	 danke 
[19:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko)
[19:32:55] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2280 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.010 second response time
[19:33:41] <thcipriani>	 ottomata: oh, does wmf.19 need to be everywhere for this change?
[19:34:07] <thcipriani>	 we're not quite there just yet https://tools.wmflabs.org/versions/
[19:34:14] <ottomata>	 not for that one
[19:34:19] <thcipriani>	 ah, cool
[19:34:23] <ottomata>	 there were 2
[19:34:29] <ottomata>	 i removed the one that was dependent on .19
[19:34:49] <ottomata>	 the code that went out last week should be using the new configs already
[19:34:55] <thcipriani>	 ottomata: ok, change is on mwdebug1002 if there's anything you want to check before it goes live
[19:35:01] <ottomata>	 ok i'll double check just in case...
[19:35:59] <wikibugs>	 (03PS3) 10Urbanecm: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:36:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:36:13] <wikibugs>	 (03CR) 10Bstorm: "This seems to do only what it is supposed to!  Madness https://puppet-compiler.wmflabs.org/compiler1002/14933/labstore2004.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm)
[19:36:43] <wikibugs>	 (03PS4) 10Urbanecm: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:36:47] <_joe_>	 !log upgrading scap on all servers
[19:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:23] <Urbanecm>	 thcipriani, should be fixed in PS4
[19:37:54] <ottomata>	 thcipriani:  all looks good
[19:37:56] <ottomata>	 proceed!
[19:37:58] <thcipriani>	 Urbanecm: thank you! I appreciate it :)
[19:38:04] <thcipriani>	 ottomata: /me does
[19:38:05] <Urbanecm>	 Yw
[19:39:14] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) The new CPU came in and I replaced CPU1  Return Shipping USPS 9202 3946 5301 2441 1128 27 FEDEX 9611918 2393026 77862845
[19:40:23] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT [[gerrit:492770|Remove legacy eventBus config settings.]] (duration: 00m 53s)
[19:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:39] <thcipriani>	 ^ ottomata live now
[19:40:49] <ottomata>	 danke
[19:41:03] <icinga-wm>	 PROBLEM - mysqld processes on labsdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[19:41:31] <chaomodus>	 page
[19:41:37] <_joe_>	 yep
[19:41:38] <marostegui>	 downtime expired I think
[19:41:40] <marostegui>	 bstorm_: ^
[19:41:43] <_joe_>	 ah ok
[19:41:54] <bstorm_>	 Yeah...but I thought I brought it back up :)
[19:41:55] <bstorm_>	 Checking
[19:42:43] <wikibugs>	 (03CR) 10jenkins-bot: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko)
[19:43:01] <marostegui>	 bstorm_: someone stopped it again apparentl
[19:43:20] <marostegui>	 190228 15:53:09 [Note] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown
[19:43:23] <bstorm_>	 It's no longer the replica of toolsdb
[19:43:27] <bstorm_>	 But still....
[19:43:32] <marostegui>	 MySQL is down
[19:43:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:43:41] <bstorm_>	 JBond42: ?
[19:44:09] <bstorm_>	 Sorry
[19:44:13] <bstorm_>	 jbond42
[19:44:20] <marostegui>	 bstorm_: I am going to dowmtime it again for now to avoid another page
[19:44:26] <bstorm_>	 I know he was looking to do reboots
[19:44:38] <bstorm_>	 Ok, thanks.  I'd expect he'd downtime it though...
[19:44:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:44:55] <marostegui>	 bstorm_: yeah, that matches the entry on the last, probably jbond42 
[19:45:00] <marostegui>	 I guess he didn't bring mysql up?
[19:45:15] <bstorm_>	 Might not have known that he has to, actually :)
[19:45:24] <thcipriani>	 anything happening that should cause me to stop SWAT?
[19:45:27] <marostegui>	 bstorm_: should I bring it up?
[19:45:29] <marostegui>	 thcipriani: nope
[19:45:29] <bstorm_>	 Sure
[19:45:31] <thcipriani>	 k
[19:45:43] <marostegui>	 bstorm_: ok, doing it
[19:45:48] <marostegui>	 it is now up
[19:46:02] <chaomodus>	 cool thanks
[19:46:03] <icinga-wm>	 RECOVERY - mysqld processes on labsdb1004 is OK: PROCS OK: 1 process with command name mysqld
[19:46:07] <bstorm_>	 thanks.  Don't bother starting slave.  clouddb1002 is already up to date with the master :)
[19:46:23] <marostegui>	 bstorm_: if this host isn't important anymore, maybe we should disable notifications
[19:46:25] <bstorm_>	 I just have to get postgres migrated, and then we can turn off notifications for that old thing
[19:46:32] <marostegui>	 ah ok :)
[19:46:59] <marostegui>	 bstorm_: I am going back to the sofa then :)
[19:47:05] <marostegui>	 Text me if you need me!
[19:47:06] <bstorm_>	 Enjoy!
[19:47:08] <bstorm_>	 Thanks
[19:47:10] <marostegui>	 o/
[19:47:13] <Urbanecm>	 thcipriani, would you have time for https://gerrit.wikimedia.org/r/493425 too please?
[19:49:13] <thcipriani>	 Urbanecm: sure. you'll need to manually rebase though, gerrit can't seem to do it.
[19:49:19] <Urbanecm>	 willdo
[19:50:17] <wikibugs>	 (03PS2) 10Urbanecm: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311)
[19:50:21] <Urbanecm>	 should be done
[19:50:58] <Urbanecm>	 btw, anybody who knows how the dbname field works? I don't think "all" is a keyword
[19:51:27] <Urbanecm>	 talking about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/493375/4/wmf-config/throttle.php
[19:53:23] <thcipriani>	 Urbanecm: yeah, we should remove that since it defaults to all. I assumed it was a group name.
[19:53:37] <Urbanecm>	 I just looked into throttle-analyze.php
[19:53:51] <wikibugs>	 (03CR) 10jenkins-bot: Add new throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493375 (https://phabricator.wikimedia.org/T216998) (owner: 10Zoranzoki21)
[19:53:51] <hashar>	 jouncebot: next
[19:53:51] <jouncebot>	 In 0 hour(s) and 6 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T2000)
[19:53:57] <Urbanecm>	 Relevant condition seems to be `isset( $options['dbname'] ) && !in_array( $wgDBname, (array)$options['dbname'] )`
[19:54:01] <hashar>	 take your time for the swat completion :)
[19:54:27] <Urbanecm>	 that doesn't seem to understand groups at all
[19:54:32] * thcipriani fixes
[19:54:37] <Urbanecm>	 thanks thcipriani 
[19:55:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs to be released concurrently to a manual update of the php extension packages on deploy1001 and deploy2001" [puppet] - 10https://gerrit.wikimedia.org/r/493485 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto)
[19:55:36] <wikibugs>	 (03PS1) 10Thcipriani: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496
[19:55:56] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani)
[19:56:38] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[19:57:01] <wikibugs>	 (03Merged) 10jenkins-bot: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani)
[19:57:26] <thcipriani>	 Urbanecm: nice catch
[19:57:31] <Urbanecm>	 thanks
[19:58:10] <wikibugs>	 (03PS3) 10Thcipriani: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[19:58:16] <wikibugs>	 (03CR) 10Thcipriani: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[19:58:20] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[19:58:23] <thcipriani>	 try that again
[19:59:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[20:00:04] <jouncebot>	 Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190228T2000)
[20:00:34] <Urbanecm>	 ehm, should I try something thcipriani ?
[20:00:41] * Urbanecm is confused
[20:00:52] <thcipriani>	 Urbanecm: nope, just lining things up, syncing shortly
[20:02:04] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:493425|Add throttle exception for Amnesty International Editathon]] [[gerrit:493496|Thottle Rules: remove "all"]] [[gerrit:493375|Add new throttle rules]] T216998 T217063 T217305 T217311 (duration: 00m 54s)
[20:02:09] <thcipriani>	 ^ Urbanecm all live now
[20:02:17] <Urbanecm>	 great, thanks thcipriani 
[20:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:23] <stashbot>	 T216998: Throttle Exception for Amnesty International edit-a-thon on March 1st - https://phabricator.wikimedia.org/T216998
[20:02:24] <stashbot>	 T217311: Throttle Exception for Amnesty International edit-a-thon on March 8th - https://phabricator.wikimedia.org/T217311
[20:02:26] <stashbot>	 T217305: Lift IP cap on en.wiki for account creation for MoMA NYC - Saturday March 2, 2019 - https://phabricator.wikimedia.org/T217305
[20:02:27] <stashbot>	 T217063: Throttle Exception for WikiConNL Utrecht on 8 & 9 March 2019 - https://phabricator.wikimedia.org/T217063
[20:02:30] <thcipriani>	 Urbanecm: thanks for your help with all those :)
[20:02:41] <Urbanecm>	 always happy to help :)
[20:02:52] <thcipriani>	 hashar: you're clear for train
[20:03:33] <Urbanecm>	 (I think European train was used this week)
[20:04:48] <thcipriani>	 I think the plan is to roll forward to all wikis now, since we're still not on group2
[20:05:12] <thcipriani>	 cf: https://tools.wmflabs.org/versions/
[20:05:21] <Urbanecm>	 ah, okay then. thanks for explaining
[20:05:41] <wikibugs>	 (03CR) 10jenkins-bot: Thottle Rules: remove 'all' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493496 (owner: 10Thcipriani)
[20:05:43] <wikibugs>	 (03CR) 10jenkins-bot: Add throttle exception for Amnesty International Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493425 (https://phabricator.wikimedia.org/T217311) (owner: 10Urbanecm)
[20:08:12] <hauskatze>	 were all the issues fixed already?
[20:08:35] <hauskatze>	 well, I guess logstash will tell us in due time :)
[20:09:28] <hashar>	 thcipriani: thx :)
[20:09:32] <hashar>	 processing with the train
[20:09:43] <hashar>	 yeah promoting all :)
[20:09:52] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499
[20:09:54] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar)
[20:11:40] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar)
[20:13:08] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19
[20:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:01] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9914 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:15:09] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.8947 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:15:17] <hashar>	 that sounds bad
[20:15:19] <herron>	 looking at that
[20:17:01] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493499 (owner: 10Hashar)
[20:18:47] <hashar>	 herron: last event received seems to have been around 19:48UTC if it can helps
[20:19:06] <hashar>	 which also mean I should probably not have processed with the promote since we are blind? :(
[20:20:14] <herron>	 thanks, yeah seeing that as well
[20:20:21] <herron>	 trying logstash bounces right now
[20:21:37] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2378 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:22:21] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[20:23:35] <hashar>	 herron: and I would guess we lost the events haven't we?
[20:23:47] <herron>	 depends on the transport
[20:23:58] <cdanis>	 the kafka ones should still make their way into logstash
[20:26:32] <cdanis>	 herron: do you have any idea what's going on yet?
[20:27:09] <herron>	 not sure yet
[20:27:45] <herron>	 it lines up with the time about when kafka was stopped on logstash1006, but that doesn’t explain why other log sources wouldn’t be flowing
[20:28:11] <icinga-wm>	 PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:28:27] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9946 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:30:04] <hashar>	 herron: at least it seems I receive events from mediawiki now 
[20:33:30] <herron>	 for the last 15 min time period?
[20:39:13] <hashar>	 herron: hmm no cancel that sorry :(
[20:39:19] <hashar>	 I was looking at another source
[20:39:25] <herron>	 ok, yeah seeing the same then
[20:39:30] <ottomata>	 hey hashar  train all good .19 out everywhere?
[20:39:49] <hashar>	 namely looking at mwlog1001.eqiad.wmnet:/srv/mw-log/exception.log
[20:40:02] <hashar>	 ottomata: I dont know :/
[20:40:56] <ottomata>	 oh no?
[20:42:08] <hashar>	 Class 'PHPlot' not found 
[20:42:09] <hashar>	 really
[20:42:21] <hashar>	 one wonder how it is even possible to get those
[20:45:19] <cdanis>	 herron: do we still use logstash's 'persistent queue' ?
[20:46:39] <herron>	 cdanis: yes
[20:46:56] <cdanis>	 there was a similar problem go.dog diagnosed over Christmas
[20:47:00] <cdanis>	 all logstashen stuck around the same time
[20:47:16] <herron>	 oh intersting.  did disabling that help?
[20:47:26] <cdanis>	 I believe he just cleared it and restarted them
[20:47:31] <herron>	 ok let’s try it
[20:47:43] <cdanis>	 2018-12-26 16:36:16     godog   ok I tried clearing logstash's persistent queue on 1007 and that seems to have done the trick
[20:47:45] <cdanis>	 2018-12-26 16:37:00     godog   !log clear logstash persistent queue /var/lib/logstash on logstash100[789]
[20:48:52] <herron>	 thanks, fingers crossed
[20:49:30] <herron>	 hah, wow
[20:49:35] <herron>	 yeah logs are flowing again
[20:49:36] <cdanis>	 similar symptoms as before: jvm threads dropped, cpu usage down, lots of messages in the receive buffer but not being processed, logstash process blocked on a futex
[20:49:39] <herron>	 that’s terrible
[20:49:46] <cdanis>	 yeah, it is
[20:49:50] <cdanis>	 I have no idea how go.dog guessed this at the time
[20:50:15] <cdanis>	 "making a note to followup on this tomorrow, we might not even want/need the persistent queue now anyways"
[20:50:22] <cdanis>	 we should figure that out ;)
[20:51:33] <herron>	 yeah I’m super inclined to disable this
[20:52:46] <herron>	 !log cleared logstash persistent queue on logstash100[7-9]
[20:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:17] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[20:53:18] <herron>	 thanks cdanis, glad you have a good memory
[20:53:39] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[20:54:25] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.00127 https://grafana.wikimedia.org/dashboard/db/logstash
[20:54:31] <cdanis>	 \o/
[20:54:42] <cdanis>	 mostly we have to thank go.dog and grep
[20:54:58] <herron>	 strange coincidence that it’s around the same time as moving logstash1006 kafka to logstash1012
[20:55:08] <herron>	 I wonder if something about that caused logstash to go into this state
[20:56:17] <cdanis>	 it is very interesting that it happened on all the hosts around the same time, then and now
[20:56:38] <cdanis>	 i don't believe there was maintenance or anything happening on 12/26 when it happened before; most people were off
[20:57:10] <herron>	 hmm yeah…  and none of the kafka work today was reverted or changed to fix this
[20:57:20] <hashar>	 apparently some old event got flushed to logstash
[20:57:30] <hashar>	 but only partially maybe 30% of them
[20:57:34] <hashar>	 with some drop around 20:00
[20:57:43] <herron>	 only change was to mv /var/lib/elasticsearch/queue /var/lib/elasticsearch/queue.bad across the log collectors
[20:58:01] <herron>	 hashar: yeah that’s thanks to kafka, and working to increase that percentage
[20:58:03] <cdanis>	 hm
[20:58:11] <hashar>	 and apparently we dont receive any since 20:34
[20:58:16] <cdanis>	 hashar: that might also just be indexing lag
[20:58:17] <hashar>	 ( I use https://logstash.wikimedia.org/app/kibana#/dashboard/api-feature-usage-top-users?_g=h@a4b4479&_a=h@092bdd1  ) 
[20:58:29] <hashar>	 they are log messages emitted for deprecated mediawiki API calls
[20:58:58] <hashar>	 and we have bots literally hammering the api. So that is a steady/constant source of messages ;)
[20:59:24] <hashar>	 so it seems still broken somehow?
[20:59:41] <wikibugs>	 (03PS3) 10Esanders: VE: Enable true section editing for mobile on labswiki & testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493482 (https://phabricator.wikimedia.org/T217365)
[20:59:55] <herron>	 hashar: it’s catching up if you keep refreshing
[21:00:23] <herron>	 these api-feature-usage logs are using kafka already and just need to catch up
[21:00:36] <hashar>	 ah
[21:00:44] <herron>	 you can tell in the tags field, there is input-kafka-rsyslog-udp-localhost.  the important part being input-kafka
[21:00:47] <hashar>	 no matter what is happening, I trust you ;))
[21:00:59] <hashar>	 I have long lost track of changes that happened on the logging front!
[21:01:26] <herron>	 haha
[21:01:50] <hashar>	 you know
[21:01:55] <hashar>	 I still have punch card at home
[21:02:33] <hashar>	 ((( meanwhile train looks fine )))
[21:02:37] <robh>	 i had 2 weeks of punchcard programming in high school because our teacher thought it was funny
[21:02:58] <robh>	 he also would intentionally mention 1 time only 'you should number your cards on the corners' and then pick someone who didnt and knock over their shit
[21:03:04] <robh>	 that teacher was an asshole.
[21:03:09] <hashar>	 well
[21:03:14] <hashar>	 it was actually a good lesson
[21:03:18] * apergos gets their 'old guy on the porch' hat on
[21:03:22] <robh>	 oh it was, but doesnt mean he wasnt an asshole!
[21:03:29] <hashar>	 ;)))))
[21:03:30] <robh>	 i knew in advance to do that shit
[21:03:34] <robh>	 someone had warned me ;D
[21:03:37] <apergos>	 I ran programs with punch cards because that's how we had to submit jobs
[21:03:42] <apergos>	 :-P :-P :-P
[21:03:57] <apergos>	 one typo and you went to fix it, submit the job again and wait another half hour...
[21:03:58] <robh>	 punchard programming, then onto turbo pascal! woooooo
[21:03:59] <hashar>	 Ariel is the wisest of us all
[21:04:11] <robh>	 indeed, everyone must stay off ariels lawn
[21:04:13] <apergos>	 or at least the oldest curmudgeonist 
[21:04:24] <apergos>	 damn straight
[21:05:06] <wikibugs>	 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10mobrovac)
[21:05:24] <apergos>	 https://www.etsy.com/listing/555343740/vintage-ibm-punch-card-wreath  we learned to make this crap out of them too :-P
[21:07:09] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3418 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[21:07:59] <cdanis>	 oh no
[21:08:08] <herron>	 oh yes
[21:08:12] <herron>	 they hung up again
[21:09:17] <herron>	 !log disabling logstash persisted queue
[21:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:40] <herron>	 disabled live for now, workin on a patch next
[21:12:43] <mobrovac>	 mediawiki seems to be saturating logstash
[21:13:21] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04888 https://grafana.wikimedia.org/dashboard/db/logstash
[21:15:26] <hashar>	 I can rollback the train
[21:15:54] <hashar>	 oh
[21:15:57] <cdanis>	 no, it seems fine
[21:16:05] <hashar>	 also from https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667
[21:16:20] <hashar>	 we had bunch of EventBus errors since 19:40
[21:16:30] <ottomata>	 looking
[21:16:38] <cdanis>	 hm, that is interesting
[21:16:45] <hashar>	 which might or might not be related to:   Synchronized wmf-config/CommonSettings.php: SWAT [[gerrit:492770|Remove legacy eventBus config settings.]] (duration: 00m 53s)
[21:17:03] <hashar>	 https://grafana.wikimedia.org/d/000000102/production-logging  I cant remember how it is constructed
[21:17:03] <ottomata>	 ya that should ahve been a no op, and things were functional on mwdebug1002 when i checked
[21:17:14] <hashar>	 I guess something reports to statsd the number of messages for each mediawiki buckets
[21:17:22] <hashar>	 (it is black magic to me really)
[21:18:13] <ottomata>	 what is that?
[21:18:16] <ottomata>	 mw logs by channel...
[21:18:36] <ottomata>	 ok logstash
[21:18:38] <hashar>	 yeah
[21:18:39] <hashar>	 sorry
[21:18:40] <hashar>	 Unable to deliver all events: (curl error: 7) Couldn't connect to server
[21:19:08] <wikibugs>	 (03PS11) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196)
[21:19:34] <hashar>	 https://logstash.wikimedia.org/goto/df640990bc4171aa955a6978c6479a64  for a search of "EventBus"  around 19:40
[21:20:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse)
[21:20:14] <ottomata>	 thank you my logstash fu is poor
[21:20:46] <ottomata>	 that looks real bad
[21:21:07] <hashar>	 should we revert the mediawiki config chnage?
[21:21:15] <ottomata>	 yes
[21:21:23] <ottomata>	 wow
[21:21:27] <ottomata>	 yikes.  
[21:22:24] <hashar>	 hmm https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492770/
[21:22:26] <ottomata>	 these are all from jobrunner
[21:22:28] <hashar>	 it has a Depends-On:
[21:22:43] <hashar>	 ah yeah on the extension
[21:22:58] <hashar>	 https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/492023/
[21:23:07] <hashar>	 which is in 1.33.0-wmf.19
[21:23:14] <hashar>	 but not in 1.33.0-wmf.18 :/
[21:23:16] <ottomata>	 petr was wrong with that  depends on anyway, it just depended on a config change
[21:23:24] <hashar>	 okkk
[21:23:30] <ottomata>	 but, in either case
[21:23:36] <ottomata>	 the .19 is out now no?
[21:23:44] <ottomata>	 so if it was that, the errros woudl stop?
[21:24:02] <hashar>	 I have no clue ;D
[21:25:14] <ottomata>	 lets revert.
[21:25:16] <ottomata>	 i dunno what is happening
[21:25:27] <wikibugs>	 (03PS1) 10Ottomata: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591
[21:25:52] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata)
[21:26:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata)
[21:27:12] <ottomata>	 ok hashar  shall I sync?
[21:27:20] <hashar>	 ottomata: doing so ;)
[21:27:22] <ottomata>	 ok thanks
[21:27:24] <hashar>	 then hmm 
[21:27:31] <hashar>	 gotta look at something to confirm :/
[21:27:40] <hashar>	 the devil is at  "something" ;)
[21:28:02] <logmsgbot>	 !log hashar@deploy1001 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 00m 47s)
[21:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:43] <ottomata>	 OH
[21:28:44] <ottomata>	 hashar: 
[21:28:55] <ottomata>	 the logsstash you send was for back in time
[21:28:56] <ottomata>	 not current
[21:29:12] <ottomata>	 it isn't erroring now, is it?
[21:29:28] <ottomata>	 that's why i couldn't find when i was looking
[21:29:30] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Remove legacy eventBus config settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493591 (owner: 10Ottomata)
[21:29:34] <hashar>	 I have one with timestamp=2019-02-28T23:13:00+00:00
[21:29:54] <hashar>	 but @timestamp 2019-02-28T20:13:00
[21:30:21] <hashar>	 time is hard :(
[21:31:25] <ottomata>	 when did the train finsih everywhere?
[21:31:46] <hashar>	 20:13 	<hashar@deploy1001> 	rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19
[21:31:52] <ottomata>	 right.
[21:32:02] <ottomata>	 rats, ok then i was totally wrong about that config change, it did depend...not sure how
[21:32:11] <ottomata>	 especially since it worked for the smoke test on mwdebug1002
[21:32:26] <ottomata>	 must be something in the job submission part I don't understand
[21:32:35] <hashar>	 I dont know really :/
[21:32:46] <hashar>	 for mediawiki.EventBus log bucket I have https://logstash.wikimedia.org/goto/aa1086cca43928785120653a1a100c6f
[21:32:59] <hashar>	 with bunch of errors from 19:40 till 19:56
[21:33:06] <hashar>	 then I guess logstash servers dropped events
[21:33:34] <cdanis>	 the logstash outage began around 20:06 or so
[21:33:44] <cdanis>	 excuse me, about 20:09
[21:34:04] <ottomata>	 the latest error for this I see is the 20:13 one
[21:34:17] <hashar>	 then there are new EventBus error from 20:10 to 20:13  roughly
[21:34:51] <ottomata>	 oh, of course, mwdebug would have had .19 when I tested during swat
[21:35:05] <ottomata>	 but group2 wouldn't have, which I wouldn't have been testing on.
[21:35:12] <hashar>	 yeah
[21:35:15] <ottomata>	 hashar:  i dunno, in either case, let's leave this reverted until next week.
[21:35:20] <ottomata>	 its almost end of day here
[21:35:20] <hashar>	 if that was the cause of the issue ;)
[21:35:21] <ottomata>	 man
[21:35:21] <herron>	 can see logstash start getting bogged down at 19:40 before the udp loss starts here https://grafana.wikimedia.org/d/000000564/logstash-herron-wip?orgId=1&from=now-3h&to=now
[21:35:28] <ottomata>	 ok i have to figure out  what to do now
[21:35:31] <ottomata>	 will file a ticket
[21:35:48] <hashar>	 and to find out what happend to logstash :/
[21:35:55] <cdanis>	 oh interesting herron, I wonder if that correlates with a growth in the size of the persistent queue
[21:36:33] <hashar>	 logstash started having udp packet loss  apparently right when wmf.19 reached all wikis :/
[21:37:00] <herron>	 yeah the node queue events graph is pretty interesting
[21:37:30] <cdanis>	 herron: a few interesting things on https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=logstash1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logstash
[21:37:34] <cdanis>	 the network TX rate
[21:37:38] <cdanis>	 and at the bottom, 'procs blocked'
[21:38:13] <cdanis>	 that's apparently just number of processes blocked on i/o
[21:38:15] <cdanis>	 interesting
[21:38:34] <hashar>	 could it be a queue starting to fill up at 19:40  and overflowing at 20:10?
[21:38:53] <herron>	 that’s what it looks like to me
[21:38:58] <hashar>	 and the new wmf.19 being deployed at 20:10  being unrelated?
[21:39:11] <cdanis>	 I suspect the mw deploy is unrelated
[21:39:24] <cdanis>	 logstash has failed this way in the past
[21:39:45] <herron>	 cdanis: well what’s interesting is the mw error lgos correlate with the beginning of the queue spike
[21:39:54] <herron>	 https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=4&fullscreen&edit&orgId=1&from=now-3h&to=now
[21:39:54] <hashar>	 and it processes way more events since 20:50 ( https://grafana.wikimedia.org/d/000000564/logstash-herron-wip?panelId=11&fullscreen&orgId=1&from=now-3h&to=now  )
[21:40:16] <cdanis>	 hm, yeah, that is true herron
[21:41:45] <hashar>	 and I dont get why the rate of mediawiki logs is twice what we had before ( https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667&panelId=21&fullscreen  )
[21:41:52] <hashar>	 maybe the queue is flushing again since 20:50
[21:42:10] <cdanis>	 i'm not sure where that graph comes from
[21:42:17] <hashar>	 not sure either :/
[21:42:38] <ottomata>	 herron:  are all of these mw logstash errors in kafka now?
[21:42:44] <hashar>	 I think it was a sprint releng did years ago so we could get a quick glance at the log spam
[21:46:26] <herron>	 think so yeah, but don’t have a great way off hand to check how far logstash is behind kafka-logging.  that would be super useful
[21:46:51] <ottomata>	 herron:  yall shoudl deploy a burrow checker for logstash
[21:46:57] <ottomata>	 it just checks and reportrs consumer lag
[21:47:03] <wikibugs>	 (03PS1) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264)
[21:47:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm)
[21:49:48] <wikibugs>	 (03PS2) 10Bstorm: wikilabels: stage the postgres roles for virtualizing the database [puppet] - 10https://gerrit.wikimedia.org/r/493608 (https://phabricator.wikimedia.org/T193264)
[21:50:06] <herron>	 ottomata: yeah that sounds perfect
[21:51:43] <wikibugs>	 (03PS1) 10Herron: logstash: disable persisted queue [puppet] - 10https://gerrit.wikimedia.org/r/493610 (https://phabricator.wikimedia.org/T200960)
[21:51:43] <framawiki>	 bblack: time to remove "temporary" test? :) https://gerrit.wikimedia.org/r/c/operations/dns/+/161426/1/templates/wikipedia.org
[21:51:45] <ottomata>	 herron:  i'm going to try to save some of this event data
[21:52:00] <ottomata>	 from the logstash topic
[21:52:04] <ottomata>	 since the event data was emitted there
[21:52:11] <herron>	 alright
[21:52:11] <ottomata>	 you ok if I consume and grep from udp_localhost-err ?
[21:52:18] <ottomata>	 i'm goign to do it from weblog1001
[21:52:22] <ottomata>	 and save the data locally there
[21:52:26] <ottomata>	 and then figure out what to do
[21:52:43] <herron>	 sure yeah, thanks for the heads up
[21:53:00] <ottomata>	 oh, i think i can't connect from there
[21:53:07] <ottomata>	 i'm going to do it from logstash1010 then
[21:53:12] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi)
[21:53:13] <ottomata>	 unless you know of a better place
[21:53:20] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) 05Open→03Resolved Everything has been moved smoothly, thanks!
[21:53:39] <herron>	 that’s fine too, but the connection from weblog1001 is timing out?
[21:54:34] <ottomata>	 i think ferm rules ya
[21:55:12] <ottomata>	 oh shouldn't be
[21:55:13] <ottomata>	 dunno why
[21:55:58] <herron>	 ok, yeah was wondering the same since a wide range of hosts are producing to it too
[21:56:16] <ottomata>	 that was fast herron 
[21:56:20] <ottomata>	 ok i got 54K events
[21:56:23] <ottomata>	 from that
[21:56:46] <wikibugs>	 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) p:05Triage→03Normal
[21:57:17] <wikibugs>	 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[21:57:17] <ottomata>	 oh no maybe more...
[21:57:22] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi)
[21:57:30] <herron>	 firehose activated
[21:57:32] <hashar>	 so beside that logstash explosion, I am tempted to say the train went fine
[21:57:34] <wikibugs>	 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[21:57:35] <ottomata>	 269K
[21:57:39] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi)
[21:57:57] <hashar>	 but I am assuming logstash manages to process enough even to have a good view of mediawiki health state :/
[21:58:07] <wikibugs>	 10Operations, 10ops-eqiad: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[21:58:16] <cdanis>	 AFAIK there aren't good indicators for logstash indexing lag
[21:58:19] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi)
[21:59:01] <XioNoX>	 !log disable asw2-a5 <> asw-a link - T217383
[21:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:04] <stashbot>	 T217383: Decommission asw2-c5-eqiad - https://phabricator.wikimedia.org/T217383
[22:00:45] <herron>	 ottomata: do you have an example of burrow configured for main, etc. to check out?
[22:00:47] <wikibugs>	 10Operations, 10ops-eqiad: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[22:01:42] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10aaron) ChronologyProtector positions should be applie...
[22:01:58] <hashar>	 cdanis: for logstash lag, assuming it insert a timestamp when it processes an event and  the original timestamp is kept; We could potentially do some query  and look at the delta
[22:02:01] <hashar>	 wild guess really :(
[22:02:19] <ottomata>	 i think in puppet somewhere
[22:02:55] <herron>	 I can dig around too if it’s not a quick link or whatever
[22:03:07] <hashar>	 anyway
[22:03:29] <hashar>	 we have logs since maybe 21:10 UTC.  Seems MediaWiki 1.33.0-wmf.19  is behaving properly 
[22:03:33] <hashar>	 so I am closing the train
[22:03:56] <ottomata>	 herron:  https://github.com/wikimedia/puppet/blob/fae4f09c6b6215b3c91946120fba930063d15963/hieradata/role/common/kafka/monitoring.yaml
[22:03:58] <hashar>	 !log MediaWiki 1.33.0-wmf.19 deployed on all wikis # T206673
[22:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:01] <stashbot>	 T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673
[22:04:22] <herron>	 ottomata: sweet thanks
[22:04:25] <ottomata>	 oh you have lofgging in there
[22:04:55] <wikibugs>	 (03PS12) 10DCausse: [WIP] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196)
[22:04:55] <herron>	 yeah… but not alerting?
[22:05:08] <ottomata>	 i think lag metrics should be in prometheus then..
[22:05:13] <ottomata>	 maybe not alerting
[22:05:52] <ottomata>	 herron: 
[22:05:52] <ottomata>	 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=logstash&from=1551369945820&to=1551391545820
[22:05:53] <ottomata>	 ?
[22:06:32] <herron>	 heyyy look at that
[22:06:52] <hashar>	 bits overflow ? ;D
[22:06:56] <herron>	 sweet, just need some alerts then
[22:07:06] <ottomata>	 ya also
[22:07:07] <ottomata>	 https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/profile/manifests/kafka/burrow.pp
[22:07:12] <ottomata>	 but elukey would be more helpful here than me
[22:07:15] <ottomata>	 i'm just looking around
[22:07:40] <wikibugs>	 (03PS1) 10Ayounsi: Remove asw2-a5-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383)
[22:08:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[22:08:47] <herron>	 thanks it’s super helpful
[22:09:47] <hashar>	 ottomata: would be nice to figure out whether the Swat change at 19:40 actually required wmf.19 which had not been deployed yet to all wikis
[22:09:59] <ottomata>	 hashar:  just made https://phabricator.wikimedia.org/T217385
[22:10:27] <ottomata>	 hashar:  indeed 
[22:10:31] <hashar>	 I was looking at https://grafana.wikimedia.org/d/000000201/eventbus?from=1551369945820&to=1551391545820&orgId=1&panelId=1&fullscreen  with "show deployments" ticked
[22:10:51] <hashar>	 and the # of requests / seconds drop when the mw-config change got deployed
[22:11:00] <ottomata>	 yeah makes sense
[22:11:03] <hashar>	 but the rate goes up again when wmf.19 got deployed everywhere
[22:11:15] <hashar>	 as to why it would have caused logstash to explode .. I cant tell 
[22:11:41] <ottomata>	 hashar i would assume those are unrelated 
[22:11:45] <ottomata>	 well
[22:11:51] <ottomata>	 i guess it would result in that many more logstash errors
[22:12:22] <ottomata>	 during that period the 270K errors were produced to logstash
[22:12:58] <ottomata>	 not that much more, thhat'd be around the order of +100-200 msgs per second
[22:13:01] <ottomata>	 in the error topic
[22:14:02] <ottomata>	 mobrovac:  not sure if you are hehre
[22:14:12] <ottomata>	 but i am going to replay these failed messages
[22:14:23] <ottomata>	 i thihnk its the right thing to do.
[22:14:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14936/" [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383) (owner: 10Ayounsi)
[22:15:02] <wikibugs>	 (03PS2) 10Ayounsi: Remove asw2-a5-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/493613 (https://phabricator.wikimedia.org/T217383)
[22:15:25] <ottomata>	 Pchelolo:  you gone?
[22:16:12] <Pchelolo>	 not yet ottomata
[22:16:15] <ottomata>	 ah k
[22:16:21] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel for user and user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664)
[22:16:36] <ottomata>	 it seems you were very right about that config change depending on an eventbus extension patch
[22:16:42] <ottomata>	 https://phabricator.wikimedia.org/T217385
[22:16:48] <ottomata>	 not sure how or why
[22:17:06] <ottomata>	 Pchelolo:  do you think i shoudl replay those events? I've captured them.
[22:17:43] <hashar>	 ottomata: thank you for the task.  I have added a screenshot on it
[22:17:57] <ottomata>	 thanks
[22:18:08] <hashar>	 and I am going to sleep ;)
[22:18:32] <ottomata>	 thank you hashar 
[22:18:34] <hashar>	 (also if you can find out why logstash exploded it would be nice ;) )
[22:18:34] <ottomata>	 sorry for the trouble
[22:18:42] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable help panel for user and user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493616 (https://phabricator.wikimedia.org/T215664)
[22:18:53] <hashar>	 beside that. The train for wmf.19 itself went fine! 
[22:19:05] <Pchelolo>	 ottomata: up to you, if it's easy to do
[22:19:23] <ottomata>	 Pchelolo:  i can script it for sure.  it is the right thing to do, rifhgt?
[22:19:37] <ottomata>	 otherwise we're missing e.g. 4277 revision-crarte events
[22:19:42] <wikibugs>	 (03PS2) 10Kosta Harlan: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664)
[22:19:43] <ottomata>	 and lots of job queue stuff?
[22:20:12] <Pchelolo>	 the job queue is more important then rev-create
[22:20:32] <ottomata>	 ya
[22:20:36] <ottomata>	 ok, then will replay...
[22:23:33] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga2001 is CRITICAL: Icinga configuration contains errors
[22:23:51] <herron>	 I’ll look at that icinga alert
[22:24:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi) a:05ayounsi→03Cmjohnson
[22:25:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[22:25:24] <wikibugs>	 (03PS1) 10Reedy: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376)
[22:25:41] <Reedy>	 jouncebot: now
[22:25:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 34 minute(s)
[22:25:43] <Reedy>	 jouncebot: next
[22:25:43] <jouncebot>	 In 1 hour(s) and 34 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190301T0000)
[22:25:57] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] "YOLO (will need testing on mwdebug)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:26:38] <XioNoX>	 herron: I pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/493613 ~10min ago dunno if related
[22:27:05] <wikibugs>	 (03Merged) 10jenkins-bot: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:27:45] <herron>	 XioNoX: yeah looks like it is related
[22:27:51] <herron>	 https://www.irccloud.com/pastebin/Mbz884sd/
[22:27:57] <XioNoX>	 hum
[22:28:22] <XioNoX>	 herron: so probably because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/493613/2/hieradata/common/monitoring.yaml
[22:28:53] <XioNoX>	 that switch is gone though
[22:29:14] <XioNoX>	 (I'll re-add the statement for now)
[22:29:30] <wikibugs>	 (03CR) 10jenkins-bot: Swap unsetting $wgSpecialPage for DisabledSpecialPage::getCallback( '' ) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493617 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:29:36] <ottomata>	 !log replaying events from mediawki eventbus  config outage - T217385
[22:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:40] <stashbot>	 T217385: EventBus mediawiki outage 2019-02-28 - https://phabricator.wikimedia.org/T217385
[22:29:46] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable some translate special pages again T217376 (duration: 00m 47s)
[22:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:50] <stashbot>	 T217376: PHP fatal error: Class 'PHPlot' not found - https://phabricator.wikimedia.org/T217376
[22:30:11] <herron>	 ok, wonder if these will clear too after running puppet on the affcted hosts?
[22:31:08] <XioNoX>	 herron: yeah I think we should remove them from puppet before removing the switch from puppet
[22:31:13] <XioNoX>	 robh: ^
[22:31:15] <wikibugs>	 (03PS1) 10Reedy: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376)
[22:31:27] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:31:40] <herron>	 ok, yeah was going to say I can’t reach these hosts haha
[22:31:59] <robh>	 those are decoms right?
[22:32:24] <wikibugs>	 (03Merged) 10jenkins-bot: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:32:38] <herron>	 worth double checking, but if so clearing them from puppetdb and re-running puppet on icinga2001 should clear this
[22:33:23] <robh>	 so is this error happening on icinga during puppet runs?
[22:33:35] <robh>	 just not sure of history, trying to sort bakclog with the bot traffic isnt easy to parse
[22:33:39] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove old translate config (duration: 00m 46s)
[22:33:40] <XioNoX>	 herron: ok, can you try to clear them from puppetdb?
[22:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:53] <herron>	 ah yeah, icinga just recently threw a warning that the config has errors
[22:33:57] <herron>	 the icinga config itself 
[22:33:59] <robh>	 yeah
[22:34:02] <robh>	 checking that file now
[22:34:05] <robh>	 and removing decom hosts
[22:34:11] <robh>	 that were in asw2-a5
[22:34:43] <herron>	 ok, removing how?
[22:34:57] <robh>	 well, i was going to just grep the entire puppet repo
[22:34:59] <XioNoX>	 from site.pp ?
[22:35:01] <robh>	 and start removing all references
[22:35:05] <robh>	 since they are no longer connected to anything
[22:35:12] <robh>	 and referencing T208584
[22:35:12] <stashbot>	 T208584: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584
[22:35:19] <robh>	 which is the  generic decom eqiad cp systems
[22:35:21] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan)
[22:35:25] <herron>	 do you know if they were already deactivated in puppet?
[22:35:40] <robh>	 i'm still trying to find a more accurate ticket
[22:35:45] <robh>	 to log this on, so standby =]
[22:35:54] <robh>	 ie: a task with a per host entry and checklist
[22:36:04] <herron>	 XioNoX: sure I’ll give that a shot
[22:36:29] <wikibugs>	 (03Merged) 10jenkins-bot: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan)
[22:36:35] <robh>	 just whoever does something will have to edit this task
[22:36:39] <robh>	 and check off what you did
[22:36:39] <XioNoX>	 I'd like to avoid having old data in Icinga, but first I'd like to not break Icinga :)
[22:36:50] <robh>	 because everyone just fixing something and not updaitn ght etask wont work ;]
[22:37:01] <XioNoX>	 robh: https://phabricator.wikimedia.org/T208586 and https://phabricator.wikimedia.org/T208584
[22:37:09] <XioNoX>	 for the decom
[22:37:16] <robh>	 yeah im on 84
[22:37:18] <robh>	 it lacks checklist
[22:37:20] <robh>	 so its not good enough
[22:37:21] <XioNoX>	 https://phabricator.wikimedia.org/T217383 for the switch itself
[22:37:22] <robh>	 im editing it to add it
[22:37:39] <robh>	 i mean, i just repeat over and over a decom task has to use the checklist
[22:37:43] <robh>	 so if it lacks it its not good enough.
[22:37:49] <wikibugs>	 10Operations, 10MediaWiki-Containers: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10hashar) And @thcipriani did https://tools.wmflabs.org/dockerregistry/
[22:40:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH)
[22:40:52] <robh>	 ok, https://phabricator.wikimedia.org/T208584 is updated
[22:41:04] <robh>	 so im going to just run the decom script on all of those, which will remove from puppet and icinga monitoring
[22:41:15] <wikibugs>	 (03CR) 10jenkins-bot: Remove FirstSteps SpecialPage disabling completely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493619 (https://phabricator.wikimedia.org/T217376) (owner: 10Reedy)
[22:41:17] <wikibugs>	 (03CR) 10jenkins-bot: (Betalabs) GrowthExperiments: Enable help for user / user talk NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493615 (https://phabricator.wikimedia.org/T215664) (owner: 10Kosta Harlan)
[22:41:19] <robh>	 if anyone wants to do another step, they certainly can, just use checkboxes =]
[22:41:26] <herron>	 cleaning puppetdb looks to do the trick for one host, so proceeding with the rest in that error msg
[22:41:44] <herron>	 puppet takes forever and a day to run on icinga, so will be a few, but that alert should clear shortly
[22:41:52] <robh>	 herron: are you using the script tod o that?
[22:41:55] <robh>	 if not, you can stop
[22:41:59] <robh>	 ill run script on al of them
[22:42:01] <robh>	 and it removes it
[22:42:13] <robh>	 ie: we try to not do manual steps if we can help it
[22:42:27] <robh>	 but, script should handle it being manually removed, if not, it'll need patching
[22:42:35] <robh>	 " - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)"
[22:43:10] * robh isnt doing anything for fear of two people doing the same thing breaking something
[22:43:19] <robh>	 i'll wait until herron lets me know whats up
[22:43:46] <herron>	 oh ok, yeah go for it. I deactivated the nodes from the error above but submitting the command again isn’t harmful
[22:44:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw2-a5-eqiad - https://phabricator.wikimedia.org/T217383 (10ayounsi)
[22:46:00] <ottomata>	 i'll be baack in a bit to check on the replay
[22:47:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, not sure what's better between using the subfolder or common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/493084 (owner: 10CRusnov)
[22:48:18] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga2001 is OK: Icinga configuration is correct
[22:48:27] <XioNoX>	 herron: https://gerrit.wikimedia.org/r/c/operations/puppet/+/480537 <3
[22:48:51] <herron>	 haha yes! one small step
[22:49:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH)
[22:50:03] <wikibugs>	 (03PS15) 10Eevans: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883)
[22:51:11] <XioNoX>	 herron: from puppet, we can probably list all the servers that have IPv6 with autoconfiguration, and apply that statement to them as well, so they have a proper v6
[22:53:26] <herron>	 it makes me so nervous to let puppet perform sweeping network interface changes on live systems
[22:53:33] <XioNoX>	 robh: ^ in case you didn't see herron's answer a few lines above
[22:53:54] <robh>	 ?
[22:54:00] <robh>	 i saw that its ok to double run commands?
[22:54:03] <robh>	 or i miss something else?
[22:54:14] <XioNoX>	 robh: "oh ok, yeah go for it."
[22:54:22] <robh>	 yeah, no worreis
[22:54:33] <robh>	 as soon as the icinga error cleared i put it on back burner and updated the task for every cp on that task
[22:54:48] <robh>	 going to just start working through it systematically since my understanding is icinga is ok again?
[22:55:47] <XioNoX>	 robh: yeah the error recovered
[22:55:52] <robh>	 =]
[22:55:54] <herron>	 yep icinga is happy, thanks robh 
[22:56:51] <XioNoX>	 thanks you two!
[22:57:09] <robh>	 i did nothing ;D
[22:57:21] <herron>	 but you did it so well!
[22:57:44] <XioNoX>	 robh: once those servers are decom, it will unblock https://phabricator.wikimedia.org/T208734 too
[23:02:42] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:03:04] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:03:28] <icinga-wm>	 RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational
[23:04:00] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga2001 is CRITICAL: 59.17 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:04:24] <XioNoX>	 what
[23:06:10] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:07:42] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:07:51] <XioNoX>	 bblack: ^ looking into that, so far no idea of what's going on, network looks fine
[23:07:52] <robh>	 !log decom cp1045-cp1055, all are role spare but may icinga alert for ping 
[23:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:18] <XioNoX>	 things seem to be back to normal
[23:08:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga2001 is OK: (C)60 le (W)70 le 71.24 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:13:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1045.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:14:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1046.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:14:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) @Marostegui I've chosen not to reimage the server because this is right now a backup testing one, I think it is ok if currently doesn't have the right enwiki data....
[23:14:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1047.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:14:35] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1048.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:14:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1049.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:15:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1050.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:15:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1051.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:15:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1052.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:15:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1053.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:15:53] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1054.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:16:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1055.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB...
[23:19:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) Please note cp1045-cp1055 are all on asw-c-eqiad as their active switch, but ports were also reserved on asw2-c-eqiad for migration (if they were not decommissioned befor...
[23:22:14] <robh>	 so much spam
[23:22:15] <robh>	 heh
[23:22:22] <robh>	 thats what happens when scripts log per host ;D
[23:22:26] <bblack>	 XioNoX: were the caches ever repooled from the row A work?
[23:22:52] <XioNoX>	 bblack: ah no!
[23:23:04] <XioNoX>	 is that the cause?
[23:23:13] <bblack>	 not that that should cause any fallout, but just noticed while verifying some things
[23:23:37] <XioNoX>	 I forgot they need to be manually repooled
[23:24:02] <bblack>	 doing it now
[23:24:54] <bblack>	 well, verifying everything looks ok, then doing it now :)
[23:25:34] <XioNoX>	 bblack: how do you repool? https://wikitech.wikimedia.org/wiki/Service_restarts#Cache_proxies_(varnish)_(cp) ?
[23:27:16] <bblack>	 "pool-once" should probably go away, it has proven to be a bad idea :)
[23:27:23] <logmsgbot>	 !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp107[678]\.eqiad\.wmnet
[23:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:44] <bblack>	 I ran: confctl select 'name=cp107[678]\.eqiad\.wmnet' set/pooled=yes
[23:27:49] <bblack>	 on puppetmaster1001 as root
[23:28:03] <bblack>	 (it autologs as shown above)
[23:28:15] <XioNoX>	 ah etcd!
[23:28:16] <wikibugs>	 (03PS1) 10RobH: decommission old eqiad cache entries [puppet] - 10https://gerrit.wikimedia.org/r/493626 (https://phabricator.wikimedia.org/T208584)
[23:29:23] <bblack>	 Re; the availability spike earlier, it does get confirmed on other graphs as a varnish 503 spike
[23:29:49] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decommission old eqiad cache entries [puppet] - 10https://gerrit.wikimedia.org/r/493626 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH)
[23:30:08] <bblack>	 all sites, cache_text
[23:30:15] <XioNoX>	 so possibly a site<->site connectivity issue?
[23:30:33] <XioNoX>	 ah, only text
[23:30:34] <bblack>	 or even an internal problem in eqiad, it affected direct eqiad requests too
[23:31:30] <XioNoX>	 !log pre-configure asw-a1 ports on asw2-a1-eqiad - T187960
[23:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:34] <stashbot>	 T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960
[23:31:34] <bblack>	 so with what little I know so far: could be an issue with any of the services way behind cache_text, or lvs1006/16 (which sits between varnish and most of those services), or network stuff
[23:33:13] <XioNoX>	 ok
[23:34:10] <wikibugs>	 (03CR) 10Bstorm: "The two repos are not 100% identical.  We can always revert, though :-D" [puppet] - 10https://gerrit.wikimedia.org/r/493451 (https://phabricator.wikimedia.org/T216712) (owner: 10BryanDavis)
[23:34:30] <bblack>	 it looks a lot like just a varnish@eqiad problem though, so far
[23:34:41] <wikibugs>	 (03PS1) 10RobH: decom cp10[45-55] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/493627 (https://phabricator.wikimedia.org/T208584)
[23:34:48] <bblack>	 the same kind of thing we've seen with a misbehaving backend, although again network things could look similar
[23:36:25] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom cp10[45-55] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/493627 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH)
[23:36:42] <XioNoX>	 Went through the network logs and can't see anything out of the ordinary, but hard to tel, yeah
[23:36:57] <bblack>	 yeah seems like cp1075 is at fault
[23:38:53] <wikibugs>	 (03PS4) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684)
[23:39:00] <wikibugs>	 (03PS5) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684)
[23:39:35] <XioNoX>	 bblack: saw that in cp1075's logs?
[23:40:10] <bblack>	 in the weblog1001 5xx logs
[23:40:23] <XioNoX>	 ok
[23:40:36] <bblack>	 I'm trying to correlate anything strange happening on cp1075 itself so we don't just end up saying "varnish misbehaves sometimes" again :P
[23:41:49] <wikibugs>	 (03PS6) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T217276)
[23:41:55] <bblack>	 but really, there's not much that correlates
[23:42:09] <wikibugs>	 (03PS1) 10Smalyshev: Run WikibaseCirrusSearch code for search on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493629 (https://phabricator.wikimedia.org/T217276)
[23:42:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH)
[23:43:36] <XioNoX>	 would be useful to have a checklist or list of links to check when this issue happen
[23:43:45] <bblack>	 everything!
[23:44:01] <XioNoX>	 even though if it's "try to find a pattern in weblog1001's log"
[23:44:10] <bblack>	 https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1
[23:44:37] <bblack>	 ^ I tend to look here often/first, and select status code type "5" and view the bottom graph and drill around sites/caches
[23:44:46] <XioNoX>	 yeah that one is useful!
[23:44:58] <XioNoX>	 !log pre-configure asw-a2 ports on asw2-a2-eqiad - T187960
[23:44:59] <bblack>	 it's similar to what the linked nginx availability graph shows, it's just older and I trust it more, but it's also clunkier
[23:45:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:02] <stashbot>	 T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960
[23:46:14] <bblack>	 given the time range was very close on ~23:00, on weblog1001 this is what gave up the correlation to cp1075 specifically (anything else like lvs or applayer, would've been more spread around the caches, in most cases):
[23:46:20] <bblack>	 bblack@weblog1001:/srv/log/webrequest$ grep 2019-02-28T23:00 5xx.json |grep -w '"503"'|jq .x_cache|cut -d, -f1|sort|uniq -c|sort -rn|head -10
[23:46:23] <bblack>	    7985 "cp1075 int
[23:46:25] <bblack>	       2 "cp1085 int
[23:46:28] <bblack>	       1 "cp1089 int
[23:46:38] <bblack>	 but I got there in steps, obviously
[23:47:28] <bblack>	 the initial grep will give you the lines from 5xx.json that happened during that minute, then grepping (somewhat imprecisely, again) for 503s, then jq pulls out the "X-Cache" header, and taking the first entry in the comma-separate list with cut...
[23:48:08] <bblack>	 I probably piped the output of most of those things into a final "| head" as I was building up, just to see the data as it pares down and transforms
[23:48:40] <bblack>	 and then once I liked it after cut, I tacked on " |sort|uniq -c|sort -rn|head -10" which will glomp them all up by hitcount and give you the top 10 (in this case, there were only 3)
[23:48:42] <XioNoX>	 makes sens
[23:49:06] <bblack>	 there's probably a vastly simpler way that doesn't involve 4 commands heh
[23:49:35] <bblack>	 but I've learned some dumb finger-memory patterns, and that one decomposes well if you want to step back a little in it
[23:49:58] <XioNoX>	 yeah, writing that down for future use
[23:51:41] <XioNoX>	 bblack: unrelated, ok to replace the pool-once with the etcd depool in https://wikitech.wikimedia.org/wiki/Service_restarts#Cache_proxies_(varnish)_(cp) ?
[23:52:26] <bblack>	 well that whole section is slightly-confusing, but yes documentation of pool-once should probably go away
[23:52:51] <bblack>	 what still works and we still like: they will auto-depool themselves on clean shutdown, and you can always depool or pool as root on a running machine by typing "pool" or "depool" on it.
[23:53:25] <bblack>	 the confctl command was just easier for me than sshing to 3 different machines to run "pool", since I was already looking in confctl just to confirm what was already in what state
[23:54:35] <XioNoX>	 the pool/depool commands are not documented neither, will add
[23:54:48] <bblack>	 the pool-once saga is that we figured since we auto-depool on shutdown, maybe it would make life easier to be able to auto-repool when it comes back for simple cases.  But we knew it should be explicit, so if you touch that pool-once file, when it powers up it will self-repool and delete the file.
[23:55:13] <bblack>	 but it turns out that's a horrible idea, because when you decide to poweroff/reboot a machine, you can't predict the future and don't know if it will ever be back
[23:55:31] <bblack>	 so we've had cases where we thought it was a quick reboot, but the hardware is borked and it never comes back.
[23:56:11] <bblack>	 then we open a hardware ticket, and some hardware part gets replaced, and maybe a week or three later someone finally boots it back up.  But having been offline a few weeks, it's definitely not a state where we want it self-repooling, which is what we told it to do when we shut it down for the "quick" reboot.
[23:56:26] <bblack>	 because it's missing all kinds of updates and changes and ocsp state, etc during the long downtime.
[23:57:26] <bblack>	 so really that whole pool-once mechanism should be killed, just haven't gotten around to it
[23:57:37] <XioNoX>	 yeah, self repool should have at least a self check before
[23:58:03] <bblack>	 the problem is there's no great way to structure that.  when it first boots who knows what it might be missing that happened during the time it was away.
[23:58:25] <bblack>	 maybe someone applied an extremely critical package update for a security bugfix to all the caches via cumin but this one missed it, or whatever.
[23:59:18] <XioNoX>	 or if down for > X hours then don't self-repool, and alert on icinga that a cp server is depool
[23:59:27] <bblack>	 it's kind of ok to auto-pool around a quick reboot, but in the general case it's probably better to just reinstall if it's been out of the loop for a long time.