[01:38:43] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[02:04:52] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[02:18:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:19:22] <icinga-wm>	 RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 80626 bytes in 0.157 second response time
[02:47:03] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:02:22] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[03:12:03] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:18:32] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:21:22] <onimisionipe>	 !log restarting inplace reindexing of enwiki and viwiki at codfw - T204362
[03:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:28] <stashbot>	 T204362: Resolve elasticsearch shard size alert by doing an in place reindex - https://phabricator.wikimedia.org/T204362
[03:28:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 801.77 seconds
[04:08:32] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.29 seconds
[05:07:23] <marostegui>	 !log Deploy schema change on s1 codfw msater - T203709
[05:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:29] <stashbot>	 T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709
[05:12:54] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699
[05:13:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui)
[05:15:14] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699
[05:15:48] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@07cbfb4]: Update mobileapps to a1fa41b
[05:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:51] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui)
[05:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui)
[05:19:06] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@07cbfb4]: Update mobileapps to a1fa41b (duration: 03m 18s)
[05:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:25] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 (duration: 01m 01s)
[05:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:37] <marostegui>	 !log Stop replication on dbstore1002 and db1103:3312 in sync
[05:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:11] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui)
[05:30:16] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[05:30:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701
[05:32:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui)
[05:33:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui)
[05:35:21] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 (duration: 00m 56s)
[05:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:19] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui)
[05:43:22] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[05:53:01] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488
[05:53:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 (owner: 10Giuseppe Lavagetto)
[05:54:12] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:00:18] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489
[06:01:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 (owner: 10Giuseppe Lavagetto)
[06:14:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: no conftool in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463704
[06:15:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::lvs::realserver: no conftool in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463704 (owner: 10Giuseppe Lavagetto)
[06:18:02] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:20:12] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:32:43] <icinga-wm>	 PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste]
[06:32:52] <icinga-wm>	 PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/pybal-check]
[06:33:49] <wikibugs>	 10Operations: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10Joe)
[06:33:53] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_established_connections]
[06:39:12] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [tilerator/deploy@22f90ee] (maps1004): Update tilerator to latest (T205462)
[06:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:19] <stashbot>	 T205462: Load OSM data into maps1004 - https://phabricator.wikimedia.org/T205462
[06:39:31] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [tilerator/deploy@22f90ee] (maps1004): Update tilerator to latest (T205462) (duration: 00m 19s)
[06:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:12] <icinga-wm>	 RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:12] <icinga-wm>	 RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:12] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:59:58] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [kartotherian/deploy@ab6cb74] (maps1004): Update kartotherian to latest (T205462)
[07:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:03] <stashbot>	 T205462: Load OSM data into maps1004 - https://phabricator.wikimedia.org/T205462
[07:00:15] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [kartotherian/deploy@ab6cb74] (maps1004): Update kartotherian to latest (T205462) (duration: 00m 16s)
[07:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:05:03] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 34 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:10:12] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:14:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:19:22] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 28 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:24:23] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 22 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:28:43] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 50 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:30:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:31:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:36:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:38:53] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:41:54] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10valerio.bozzolan) Apparently yes.
[07:44:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475
[07:44:32] <yannf>	 https://phabricator.wikimedia.org/T205636 this is a very serious issue
[07:45:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:48:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto)
[07:52:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907)
[07:52:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907)
[07:52:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WIP: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907)
[07:54:03] <icinga-wm>	 PROBLEM - confd service on wtp1045 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive
[07:54:26] <banyek>	 !log disabling puppet on labsdb1009, labsdb1010, labsdb1011
[07:54:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:51] <banyek>	 !log disabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)
[07:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:55] <stashbot>	 T183983: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983
[07:55:16] <_joe_>	 the confd thing on wtp1045 is me
[07:55:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid: fixup for the https uri [puppet] - 10https://gerrit.wikimedia.org/r/463710
[07:57:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: fixup for the https uri [puppet] - 10https://gerrit.wikimedia.org/r/463710 (owner: 10Giuseppe Lavagetto)
[07:58:20] <wikibugs>	 (03CR) 10Banyek: [C: 032] wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[08:10:42] <wikibugs>	 (03PS30) 10Banyek: wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983)
[08:10:48] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[08:14:00] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 33 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[08:18:40] <icinga-wm>	 PROBLEM - Check systemd state on labsdb1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:19:00] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[08:22:46] <wikibugs>	 (03PS2) 10Elukey: Add an-coord1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463297 (https://phabricator.wikimedia.org/T204970)
[08:24:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add an-coord1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463297 (https://phabricator.wikimedia.org/T204970) (owner: 10Elukey)
[08:28:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can be found in `/var...
[08:32:50] <godog>	 arturo: I am around now, what's up?
[08:39:59] <wikibugs>	 (03PS1) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825)
[08:40:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn)
[08:40:53] <wikibugs>	 10Operations, 10netops: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10Marostegui)
[08:41:18] <wikibugs>	 (03PS1) 10Banyek: wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983)
[08:42:25] <wikibugs>	 (03CR) 10Marostegui: [C: 031] wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[08:42:52] <wikibugs>	 (03CR) 10Banyek: [C: 032] wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[08:48:56] <icinga-wm>	 RECOVERY - Check systemd state on labsdb1010 is OK: OK - running: The system is fully operational
[08:50:21] <wikibugs>	 (03PS2) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825)
[08:54:04] <_joe_>	 !log rolling restart of parsoid in eqiad
[08:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:12] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet'] ```
[08:55:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) So attempting to install the OS leads to:  ``` iDRAC Settings: CBL0009: Backplane 1 connector A0 is not connected. CBL0009: Backplane 1 connector B0 is not c...
[08:56:38] <_joe_>	 !log rolling restart of parsoid in codfw; afterwards, parsoid will connect to the MediaWiki API via HTTPS
[08:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:46] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1092, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392)
[09:09:38] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392)
[09:09:58] <wikibugs>	 (03CR) 10Marostegui: "Do you mean db1092 (what the commit says) or db1093 (what the patch does)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[09:10:11] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[09:10:49] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Also depool db1093 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[09:15:15] <wikibugs>	 10Operations, 10Recommendation-API, 10Research, 10SCB, 10Services (next): Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10fgiunchedi) >>! In T205452#4625109, @bmansurov wrote: > Friendly ping @Joe @fgiunchedi. Can you please help with this task? Thanks!  Looks like @jcre...
[09:15:51] <volans>	 !log Set Racktables in read-only mode - T199083
[09:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:57] <stashbot>	 T199083: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083
[09:18:58] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392)
[09:22:43] <wikibugs>	 10Operations, 10Patch-For-Review: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) >>! In T205526#4624463, @MoritzMuehlenhoff wrote: > Can you please add the nickserv password to pwstore?  I've added the password to `private.git`, is it needed inside pwstore too in thi...
[09:25:26] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[09:33:12] <godog>	 !log test formatting sdh and sdi on ms-be2040 with crc=0 - T199198
[09:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:17] <stashbot>	 T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198
[09:33:34] <wikibugs>	 (03PS9) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492
[09:40:57] <wikibugs>	 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) All filesystems on ms-be1040 have been reformatted with `crc=0` and almost all have finished filling up again with data. I ha...
[10:03:41] <arturo>	 godog: I was playing with the prometheus-openstack-exporter the other day
[10:03:48] <arturo>	 I plan to followup today
[10:04:07] <wikibugs>	 (03PS37) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[10:04:09] <wikibugs>	 (03PS8) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885)
[10:04:40] <yannf>	 https://phabricator.wikimedia.org/T205636 this is a very serious issue
[10:04:47] <yannf>	 I know IE is evil, but...
[10:05:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[10:05:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[10:07:01] <icinga-wm>	 RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops
[10:07:49] <akosiaris>	 yannf: https://phabricator.wikimedia.org/T205636#4626267 says will be fixed on Monday
[10:08:05] <akosiaris>	 so I am guessing when the deployment happens it will be resolved 
[10:10:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry)
[10:10:38] <Amir1>	 !log mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=fawiki --delete (T201009)
[10:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:43] <stashbot>	 T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009
[10:12:22] <yannf>	 akosiaris, ok, I am pinging here because it is still in "Needs Triage"
[10:14:19] <godog>	 arturo: sweet! 
[10:15:26] <Amir1>	 !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --prefix on all CentralAuth wikis (T201009)
[10:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:38] <arturo>	 godog: I guess the monitoring host is not labmon. to my understanding, labmon is only for things in cloudvps (i.e., virtual), and these openstack metrics are more 'physical', right?
[10:20:34] <godog>	 arturo: eh good question, there is prometheus on labmon and IMHO since openstack-exporter is used only in wmcs metrics should live there
[10:20:58] <godog>	 in my mind the split is more along prod/wmcs lines rather than virtual/physical
[10:21:42] <godog>	 !log repair /dev/sdf1 /dev/sde1 on ms-be1041 - T199198
[10:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:47] <stashbot>	 T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198
[10:30:06] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1030).
[10:30:27] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546)
[10:36:55] <wikibugs>	 (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:38:02] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:41:26] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:463723|Bumping portals to master (T128546)]] (duration: 00m 59s)
[10:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:30] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:42:23] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:463723|Bumping portals to master (T128546)]] (duration: 00m 56s)
[10:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:41] <wikibugs>	 (03CR) 10Mholloway: [C: 031] "Nice find. LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[10:50:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[10:51:01] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:51:10] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[10:51:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 483.80 seconds
[10:52:06] <jynus>	 commons issues?
[10:52:36] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093, db1064 (duration: 00m 57s)
[10:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:07] <jynus>	 there is 2 slow queries 1-hour long on those hosts
[10:53:22] <volans>	 :(
[10:53:52] <jynus>	 I am going to depool the host
[10:54:19] <volans>	 is not the only slave delayed in s4
[10:55:36] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) p:05Triage>03High
[10:55:43] <jynus>	 not?
[10:55:46] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725
[10:55:51] <jynus>	 then there is another commons issue
[10:56:47] <jynus>	 yep, we have major commons isues
[10:56:49] <volans>	 I was looking at tendril
[10:57:21] <jynus>	 I am going to depool it anyway
[10:57:24] <volans>	 ack
[10:57:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725 (owner: 10Jcrespo)
[10:58:15] <jynus>	 volans: mediawiki works badly when one host is lagged
[10:58:23] <volans>	 I know
[10:58:57] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2058 (duration: 00m 57s)
[10:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:02] <jynus>	 Check 'Logstash Error rate for mw2226.codfw.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.09, After: 2.00, Threshold: 1.00)
[11:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1100).
[11:00:04] <jouncebot>	 kart_ and Pl217: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:17] * kart_ is here
[11:00:46] <zeljkof>	 o/
[11:00:50] <zeljkof>	 I can SWAT today
[11:01:01] <jynus>	 stop SWAT
[11:01:04] <volans>	 zeljkof: we're having some issues in the database cluster for commons (s4)
[11:01:05] <jynus>	 we have a major outage
[11:01:27] <zeljkof>	 jynus, volans: ok, did not even start yet
[11:01:56] <zeljkof>	 jynus, volans: could things get back to normal during swat, or should I just cancel this window?
[11:02:25] <volans>	 zeljkof: too early to say at this moment, the issue started 10m ago
[11:02:38] <zeljkof>	 volans: ok, swat on hold
[11:02:51] <zeljkof>	 kart_: ^
[11:02:56] <volans>	 I'll keep you posted
[11:03:12] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.83 seconds
[11:03:14] <jynus>	 it is not the servers
[11:03:15] <kart_>	 zeljkof: yep. Ping me in case we are back to normal.
[11:03:16] <zeljkof>	 volans: thanks!
[11:03:20] <jynus>	 it is a commons maintenance
[11:03:21] <zeljkof>	 kart_: will do
[11:03:29] <jynus>	 which is creating the outage
[11:04:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.07 seconds
[11:05:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.54 seconds
[11:05:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.34 seconds
[11:06:05] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Conn yields updated to today:  ``` mc2019.codfw.wmnet: STAT conn_yields 5650...
[11:06:08] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725 (owner: 10Jcrespo)
[11:06:34] <jynus>	 Amir1: that is you running the maintenance
[11:06:56] <_joe_>	 Amir1: please kill that script or I'll do it
[11:06:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:06:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk)
[11:07:16] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:08:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:08:36] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.05 seconds
[11:09:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:09:54] <_joe_>	 !log killed bash runner.sh by user ladsgroup on mwmaint2001
[11:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:30] <Amir1>	 _joe_: thanks, I was afk for a bit
[11:10:43] <_joe_>	 Amir1: heh np, it was causing an outage
[11:10:44] <Amir1>	 Sorry for the trouble 
[11:10:53] <_joe_>	 btw, why didn't you use foreachwikiindblist ?
[11:11:28] <jynus>	 I will repool db2058
[11:11:34] <jynus>	 it was not only it
[11:11:36] <Amir1>	 Because the group doesn't exist so I manually composed the list
[11:11:38] <volans>	 ack
[11:11:50] <wikibugs>	 (03CR) 10Mholloway: [C: 04-1] "Oops, one issue inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[11:11:51] <volans>	 (my ack was to jaime)
[11:12:04] <Amir1>	 tgr|away: fyi ^
[11:12:45] <jynus>	 Amir1: no danger for it to start automatically, right?
[11:12:55] <Amir1>	 No
[11:13:01] <jynus>	 so I can unblock the swat
[11:13:08] <Amir1>	 Yup
[11:13:13] <Amir1>	 Sorry again 
[11:13:20] <jynus>	 but wait for my deployment, first , zeljkof
[11:13:49] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728
[11:14:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo)
[11:15:13] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo)
[11:16:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 552.46 seconds
[11:16:48] <zeljkof>	 jynus: sure, let me know when I can start swat
[11:16:58] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2058 (duration: 00m 55s)
[11:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 440.58 seconds
[11:17:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 441.13 seconds
[11:17:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 445.59 seconds
[11:17:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 452.27 seconds
[11:17:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 464.86 seconds
[11:17:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 466.61 seconds
[11:17:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.90 seconds
[11:17:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.97 seconds
[11:17:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on db1113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 490.54 seconds
[11:18:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1110 is OK: OK slave_sql_lag Replication lag: 59.86 seconds
[11:18:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.43 seconds
[11:18:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[11:18:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:18:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1100 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:18:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.16 seconds
[11:19:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.10 seconds
[11:19:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1113 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:19:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:20:35] <jynus>	 zeljkof: you should be ok now, but you will have lots of errors on the log in the last 1h
[11:20:55] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo)
[11:21:06] <jynus>	 e.g. scap may complain
[11:22:00] <zeljkof>	 jynus: ok, I'll try, if scap is not happy, I'll abort
[11:23:38] <kart_>	 zeljkof: you can go first with config patch, if that is quicker thing to do.
[11:26:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.42 seconds
[11:27:12] <zeljkof>	 Pl217: around for SWAT?
[11:27:36] <kart_>	 zeljkof: I'm taking care of that patch..
[11:27:40] <zeljkof>	 kart_: looks like the other person is not around for SWAT, I'll review and merge your patch, it could take a while to merge :/
[11:27:50] <zeljkof>	 kart_: ah, you're in charge of both patches?
[11:28:05] <kart_>	 zeljkof: yep. updated page.
[11:28:12] <zeljkof>	 I'll merge both, and deploy the config first, since extension will take a while to merge
[11:28:19] <kart_>	 OK!
[11:29:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177)
[11:31:09] <zeljkof>	 kart_: there is no related task for 460492?
[11:31:36] <zeljkof>	 (just asking, it's not required)
[11:32:07] <zeljkof>	 but it's rare that there is no phab task, so checking if it's a mistake
[11:33:00] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177)
[11:33:53] <kart_>	 zeljkof: it is fine. It was followed up from: https://phabricator.wikimedia.org/T202286
[11:34:02] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:34:20] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic)
[11:34:40] <wikibugs>	 (03PS1) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734
[11:34:41] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:34:46] <kart_>	 zeljkof: commit message has all details about it.
[11:34:51] <icinga-wm>	 PROBLEM - DPKG on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:34:51] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:34:51] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:34:51] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:01] <icinga-wm>	 PROBLEM - configured eth on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:01] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:02] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:11] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:12] <icinga-wm>	 PROBLEM - Disk space on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:21] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:21] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:22] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:22] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[11:35:33] <zeljkof>	 kart_: ok
[11:35:39] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic)
[11:35:42] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[11:35:42] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[11:35:42] <icinga-wm>	 RECOVERY - DPKG on ms-be2040 is OK: All packages OK
[11:35:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2040 is OK: OK ferm input default policy is set
[11:35:51] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[11:35:52] <kart_>	 Now, what are these errors? :/
[11:36:00] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177)
[11:36:01] <icinga-wm>	 RECOVERY - configured eth on ms-be2040 is OK: OK - interfaces up
[11:36:01] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[11:36:02] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[11:36:02] <wikibugs>	 (03PS2) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983)
[11:36:02] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[11:36:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational
[11:36:11] <icinga-wm>	 RECOVERY - Disk space on ms-be2040 is OK: DISK OK
[11:36:18] <zeljkof>	 kart_: 460492 is at mwdebug2001
[11:36:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2040 is OK: OK: nf_conntrack is 2 % full
[11:36:21] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[11:36:21] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[11:36:21] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[11:36:31] <kart_>	 zeljkof: testing..
[11:36:48] <wikibugs>	 (03CR) 10jenkins-bot: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic)
[11:38:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm)
[11:38:42] <kart_>	 zeljkof: looks fine.
[11:38:51] <zeljkof>	 kart_: ok, deploying
[11:39:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler happy: https://puppet-compiler.wmflabs.org/compiler1002/12686/" [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[11:41:07] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:460492|Remove unused default source language config for CX]] (duration: 00m 57s)
[11:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:14] <zeljkof>	 kart_: deployed
[11:41:18] <kart_>	 cool.
[11:41:34] <zeljkof>	 kart_: waiting for 463446 to merge...
[11:43:23] <kart_>	 Yeah. We can go and finish 5K run till it gets merge ;)
[11:43:51] <zeljkof>	 yes, about that time :D
[11:52:22] <arturo>	 !log install prometheus-openstack-exporte 0.0.8-3 in reprepro T203177
[11:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:26] <stashbot>	 T203177: cloudvps: metrics and analytics  - https://phabricator.wikimedia.org/T203177
[11:52:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/dirs: include /var/cache/prometheus-openstack-exporter [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463736 (https://phabricator.wikimedia.org/T203177)
[11:52:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/changelog: refresh for 0.0.8-3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463737
[11:52:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/dirs: include /var/cache/prometheus-openstack-exporter [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463736 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[11:53:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/changelog: refresh for 0.0.8-3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463737 (owner: 10Arturo Borrero Gonzalez)
[11:56:51] <jynus>	 !log stopping db1093 to clone it to dbstore1001
[11:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:21] <kart_>	 zeljkof: merged?
[12:02:31] <zeljkof>	 kart_: yes, a few seconds ago
[12:02:56] <kart_>	 zeljkof: What are these post-merge tests?
[12:03:12] <zeljkof>	 kart_: I don't really know :D
[12:03:36] <zeljkof>	 ah, coverage report
[12:05:23] <zeljkof>	 kart_: the patch is at mwdebug2001
[12:06:19] <wikibugs>	 (03CR) 10Mholloway: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[12:07:01] <kart_>	 zeljkof: okay. testing.
[12:08:30] <wikibugs>	 (03CR) 10Banyek: [C: 032] wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[12:09:38] <wikibugs>	 (03PS3) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983)
[12:09:44] <wikibugs>	 (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek)
[12:10:47] <kart_>	 zeljkof: looks OK. Go ahead.
[12:11:00] <zeljkof>	 kart_: ok, deploying
[12:11:09] <kart_>	 zeljkof: needed to confirm, so double check on few articles, so took more time..
[12:11:42] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 (owner: 10BryanDavis)
[12:12:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 (owner: 10BryanDavis)
[12:12:54] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/ContentTranslation/: SWAT: [[gerrit:463446|Fix error in CXTransclusionNode#afterRender method (T205521)]] (duration: 00m 59s)
[12:12:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:59] <stashbot>	 T205521: Working with references is broken in CX2 - https://phabricator.wikimedia.org/T205521
[12:13:13] <zeljkof>	 kart_: deployed! please check and thanks for deploying with #releng ;)
[12:13:18] <zeljkof>	 !log EU SWAT finished
[12:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:27] <kart_>	 zeljkof: cool. Thanks a lot!
[12:15:05] <banyek>	 !log enabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983)
[12:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:09] <stashbot>	 T183983: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983
[12:19:28] <banyek>	 !log stopping replication on s2@dbstore20002: the tables being compressed
[12:19:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:36] <banyek>	 !log stopping replication on s2@dbstore20002: the tables being compressed (T204930)
[12:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:40] <stashbot>	 T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930
[12:20:05] <wikibugs>	 10Operations, 10Recommendation-API, 10Research, 10SCB, 10Services (next): Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) Thanks @fgiunchedi.  @mobrovac no blockers left?
[12:26:19] <banyek>	  converting enwiki.categorylinks to TokuDB on host dbstrore1002 (T205544)
[12:26:20] <stashbot>	 T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544
[12:26:28] <banyek>	 !log  converting enwiki.categorylinks to TokuDB on host dbstrore1002 (T205544)
[12:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:08] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[12:34:21] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add two timers [puppet] - 10https://gerrit.wikimedia.org/r/463742 (https://phabricator.wikimedia.org/T172532)
[12:35:05] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: add two timers [puppet] - 10https://gerrit.wikimedia.org/r/463742 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[12:36:11] <wikibugs>	 (03PS2) 10Elukey: yarn_http: Restrict to caches [puppet] - 10https://gerrit.wikimedia.org/r/463431 (owner: 10Muehlenhoff)
[12:38:23] <akosiaris>	 !log upload hfst_3.13.0~r3461-1+wmf2 to apt.wikimedia.org/jessie-wikimedia/main. T199962
[12:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:28] <stashbot>	 T199962: Apertium maintenance updates (July-September) - https://phabricator.wikimedia.org/T199962
[12:39:09] <wikibugs>	 (03CR) 10Elukey: [C: 032] yarn_http: Restrict to caches [puppet] - 10https://gerrit.wikimedia.org/r/463431 (owner: 10Muehlenhoff)
[12:42:32] <wikibugs>	 (03PS2) 10Elukey: Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 (owner: 10Muehlenhoff)
[12:43:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 (owner: 10Muehlenhoff)
[12:44:18] <kart_>	 akosiaris: thanks!
[12:46:22] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi)
[12:46:24] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The architecture and gaps review has been carried out as part of the [[ https://docs.google.com/document/d/1Aq-Dhq3SbRCPmQdw6jjaHvH...
[12:47:11] <wikibugs>	 (03PS3) 10Elukey: Filter out duplicates in allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/463253 (owner: 10Muehlenhoff)
[12:47:58] <wikibugs>	 (03CR) 10Elukey: [C: 032] Filter out duplicates in allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/463253 (owner: 10Muehlenhoff)
[12:48:20] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi)
[12:48:26] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The audit/list of current and future logs produce...
[12:49:07] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi)
[12:49:09] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The architecture and gaps review has been carried out as part of the [[ https://docs.google.com/doc...
[12:50:06] <wikibugs>	 (03PS4) 10MSantos: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201)
[12:51:25] <wikibugs>	 (03CR) 10MSantos: "Agreed. I was a bit confused on that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[12:53:22] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[12:56:23] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10fgiunchedi)
[12:58:25] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi)
[13:00:24] <wikibugs>	 (03PS1) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447)
[13:00:39] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217
[13:02:06] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10fgiunchedi)
[13:06:04] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi)
[13:06:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12688/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto)
[13:06:58] <wikibugs>	 (03PS1) 10Volans: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079)
[13:07:23] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[13:07:54] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[13:07:59] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi)
[13:08:10] <wikibugs>	 (03CR) 10KartikMistry: [C: 04-1] "Only to be deployed with schedule plan and testing in Beta or Labs(?)." [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447) (owner: 10KartikMistry)
[13:09:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto)
[13:10:06] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[13:11:06] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi)
[13:11:08] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi)
[13:11:10] <wikibugs>	 (03PS1) 10Volans: ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522)
[13:11:12] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi)
[13:12:12] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops
[13:12:29] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi)
[13:12:52] <Amir1>	 _joe_: jynus: FYI https://gerrit.wikimedia.org/r/c/mediawiki/core/+/463748 I didn't notice this doesn't exist in this maintenance script. I didn't write it, if I did, I would've added it
[13:12:55] <Amir1>	 sorry again
[13:14:06] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi)
[13:15:09] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add s6 instance to dbstore1001 for backup generation [puppet] - 10https://gerrit.wikimedia.org/r/463751 (https://phabricator.wikimedia.org/T201392)
[13:16:15] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi)
[13:16:19] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi)
[13:16:24] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10fgiunchedi)
[13:16:34] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi)
[13:16:36] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Logstash, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10fgiunchedi)
[13:16:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Add s6 instance to dbstore1001 for backup generation [puppet] - 10https://gerrit.wikimedia.org/r/463751 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[13:16:40] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi)
[13:17:32] <jynus>	 Amir1: you don't need to say sorry
[13:19:32] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3149.90 seconds
[13:20:08] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add elasticsearch_cluster module (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[13:20:40] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10herron)
[13:20:43] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) 05Open>03Resolved a:03herron
[13:22:53] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968)
[13:23:31] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This task and its related go...
[13:23:31] <wikibugs>	 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10faidon) I'm not familair with openbugbounty.org, but looking [[ https://www.openbugbounty.org/bugbounty/create/ | at their website...
[13:24:02] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:24:30] <wikibugs>	 (03CR) 10Mholloway: [C: 031] Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos)
[13:25:51] <icinga-wm>	 RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops
[13:28:17] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968)
[13:28:43] <jynus>	 ^ banyek dbstore1002 alerting again
[13:30:31] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:30:45] <marostegui>	 Probably disabling lag for 24h is a good idea while doing heavy IO operations with that host, not the first time we see that :(
[13:31:41] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) One more datapoint. While a memcached icinga error was still ongoing I ran m...
[13:32:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12690/mw1261.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[13:33:24] <wikibugs>	 (03CR) 10Gehel: "Looking good! I'll do a last pass when the parent CR is merged." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[13:34:42] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:34:56] <jynus>	 marostegui: tokudb aleters are not online
[13:35:11] <jynus>	 they actually block replication
[13:35:58] <marostegui>	 jynus: More reasons to downtime then :)
[13:36:57] * addshore reads up
[13:40:12] <icinga-wm>	 PROBLEM - mysqld processes on dbstore1001 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld
[13:40:45] <jynus>	 ^this is me, but it was supposed to have notifications disabled
[13:41:03] <jynus>	 I guess icinga weird puppet race condition issues
[13:41:22] <wikibugs>	 10Operations, 10monitoring: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi)
[13:42:05] <wikibugs>	 (03PS73) 10Gehel: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson)
[13:42:27] <jynus>	 rerunning puppet again and see if icinga setup kicks-in
[13:42:35] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi)
[13:42:41] <wikibugs>	 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10fgiunchedi)
[13:42:43] <wikibugs>	 10Operations, 10monitoring: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi)
[13:42:45] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi)
[13:43:31] <icinga-wm>	 PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:43:32] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:44:01] <jynus>	 memcache errors are still up and stable since 13:20
[13:46:08] <jynus>	 mw2204
[13:46:52] <elukey>	 up and stable?
[13:47:26] <jynus>	 I am wrong
[13:47:29] <jynus>	 it is not a single server
[13:47:35] <jynus>	 but happens in bursts
[13:47:37] <elukey>	 seems more intermittent, it seems the usual gadget-definition issue :(
[13:47:50] <elukey>	 T203786
[13:47:51] <jynus>	 I don't see a pattern
[13:47:51] <stashbot>	 T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786
[13:49:02] <jynus>	 elukey: however, the errors seem quite stable: https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen&from=now-24h&to=now
[13:49:29] <jynus>	 and now they are gone
[13:50:18] <jynus>	 is there a memcache logs dashboard?
[13:50:57] <elukey>	 https://logstash.wikimedia.org/app/kibana#/dashboard/memcached
[13:51:01] <wikibugs>	 (03CR) 10Gehel: [C: 032] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson)
[13:51:10] <jynus>	 elukey: thanks!
[13:51:14] <elukey>	 yes but if you see the same sustained pattern happened before
[13:51:27] <elukey>	 I am pretty sure it is this annoying issue with gadget-definition
[13:51:58] <banyek>	 !log Downtimed the slave lag monitoring on dbstore1002 while the tables getting converted (T205544)
[13:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:03] <stashbot>	 T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544
[13:52:11] <jynus>	 banyek: thanks!
[13:52:38] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans)
[13:52:47] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[13:52:55] <banyek>	 it was downtimed when i started the conversion it might be expired (marvin-bot iirc)
[13:54:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760
[13:54:10] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans)
[13:55:18] <wikibugs>	 (03CR) 10Elukey: [C: 031] "didn't spot anything weird from pcc, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[13:56:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[13:56:25] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968)
[13:57:33] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351)
[13:58:48] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel)
[13:59:10] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351)
[13:59:58] <wikibugs>	 (03PS1) 10Elukey: Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962)
[14:00:00] <elukey>	 bstorm_: o/ - can you ping me when you are online?
[14:00:18] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam)
[14:00:37] <wikibugs>	 (03PS2) 10Elukey: Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962)
[14:00:52] <bstorm_>	 I'm online (in EST right now)
[14:00:58] <bstorm_>	 elukey: what's up?
[14:01:19] <elukey>	 ah nice! I wanted to ask you an opinion about the systemd timers types
[14:01:30] <bstorm_>	 Sure
[14:01:30] <elukey>	 we can do in pvt so I will not spam in here :)
[14:01:37] <bstorm_>	 Yup :)
[14:04:56] <jynus>	 banyek: I am not worried, but just in case, I cannot see downtime or ack or disabling of alerts on dbstore1002
[14:05:00] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron)
[14:05:19] <jynus>	 pinging just in case it was an icinga issue (sometimes it happens)
[14:06:09] <jynus>	 godog: the "Logstash rate of ingestion percent change compared to yesterday" I am going to guess was the small incident we had a few hours ago, which sends lots of errors per second
[14:06:38] <wikibugs>	 (03PS1) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763
[14:07:07] <banyek>	 jynus: it was an icinga issue as it was too slow when I hit the button for setting downtime, and I navigated away from page before i submitted the downtime (it was my bad tbh.)
[14:07:40] <jynus>	 banyek: I really mean it, sometimes icinga doesn't do downtimes
[14:07:44] <jynus>	 and requires a restart
[14:08:18] <wikibugs>	 (03PS2) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763
[14:09:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk)
[14:09:27] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 161.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[14:10:07] <wikibugs>	 (03PS1) 10Anomie: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308)
[14:10:36] <wikibugs>	 (03PS3) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763
[14:10:38] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[14:10:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk)
[14:11:02] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) Hi @chelsyx, could you please clarify what is meant by `stats` role account?  I don't see a group...
[14:11:17] <wikibugs>	 (03CR) 10Anomie: [C: 032] "Deploy config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie)
[14:12:32] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie)
[14:12:56] <wikibugs>	 (03PS7) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581
[14:13:09] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron)
[14:13:31] <wikibugs>	 (03CR) 10Alex Monk: [C: 032] "rebased already-approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk)
[14:14:00] <logmsgbot>	 !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting MCR migration stage to write-both/read-new on mediawikiwiki (T198308) (duration: 00m 56s)
[14:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:06] <stashbot>	 T198308: Enable MCR migration stage "write both, read new" on live systems - https://phabricator.wikimedia.org/T198308
[14:15:18] <wikibugs>	 (03Merged) 10jenkins-bot: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk)
[14:15:43] <wikibugs>	 (03PS4) 10Alex Monk: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662
[14:15:50] <wikibugs>	 (03CR) 10Alex Monk: [C: 032] "rebase already-approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk)
[14:16:10] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-1] "needs tests and rebase" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk)
[14:16:37] <icinga-wm>	 PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:16:40] <godog>	 jynus: yes that's likely it
[14:16:49] <wikibugs>	 (03PS1) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768
[14:16:58] <wikibugs>	 (03CR) 10jenkins-bot: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk)
[14:17:01] <wikibugs>	 (03PS2) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768
[14:17:05] <wikibugs>	 (03PS4) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763
[14:17:15] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam)
[14:17:45] <elukey>	 matomo1001 is failing due to me, first puppet run, apache + php
[14:17:47] <elukey>	 :P
[14:17:55] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 (owner: 10ArielGlenn)
[14:18:13] <wikibugs>	 (03Merged) 10jenkins-bot: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk)
[14:18:44] <addshore>	 any idea what the difference in specs between mwmaint1001 and mwmaint2001 are?
[14:18:47] <icinga-wm>	 RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational
[14:19:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] "I double-checked what got installed on labweb1001 (which is Stretch) and it looks like Newton." [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm)
[14:19:43] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron)
[14:19:51] <wikibugs>	 (03CR) 10jenkins-bot: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk)
[14:20:07] <icinga-wm>	 PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[matomo]
[14:21:40] <akosiaris>	 addshore: CPU wise ? 24 vs 40 cpus if that is of any help. memory wise they are the same. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mwmaint2001&var-datasource=codfw%20prometheus%2Fops vs https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mwmaint1001&var-datasource=eqiad%20prometheus%2Fops
[14:21:52] <addshore>	 thanks!
[14:22:08] <addshore>	 trying to figure out why wikidata dispatching got 5 / 10 or something times faster with the DC switchover
[14:22:14] <addshore>	 related ticket is onId > 0 && $entityRevision =
[14:22:23] <addshore>	 .. no its not... it is https://phabricator.wikimedia.org/T205865
[14:23:47] <wikibugs>	 (03CR) 10jenkins-bot: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie)
[14:23:49] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam)
[14:23:52] <akosiaris>	 hmm, interesting
[14:25:04] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add elasticsearch_cluster module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[14:25:16] <icinga-wm>	 RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:25:26] <wikibugs>	 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[14:26:42] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-1] "per discussion on IRC, should document that while in this state the certificate will have the wrong domains, but it a) keeps the existing " [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk)
[14:29:19] <wikibugs>	 (03CR) 10Mathew.onipe: Add elasticsearch_cluster module (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[14:30:50] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac)
[14:31:57] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac)
[14:33:53] <wikibugs>	 10Operations, 10Traffic: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) Apache Traffic Server v8.0.0 [[http://trafficserver.apache.org/downloads#8.0.0  | was released ]] on September 25th, 2018. Debian packaging work [[https://salsa.debian.org/debian/trafficserver/merge_request...
[14:34:56] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760
[14:35:43] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760
[14:37:10] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui)
[14:37:22] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) p:05Triage>03Normal
[14:37:34] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2058 is CRITICAL: cluster=mysql device=cciss,0 instance=db2058:9100 job=node site=codfw Marostegui T205872 - The acknowledgement expires at: 2018-10-04 14:34:14. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw%2520prometheus%252Fops
[14:37:42] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10fgiunchedi)
[14:38:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760 (owner: 10Alexandros Kosiaris)
[14:39:33] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi)
[14:40:32] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968)
[14:42:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522) (owner: 10Volans)
[14:43:01] <volans>	 thanks godog! :)
[14:43:12] <wikibugs>	 (03PS4) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254)
[14:43:20] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:43:20] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:43:20] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[14:43:22] <godog>	 volans: np! thanks to you for taking the time
[14:43:30] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:43:31] <godog>	 ms-be1041 is me, silence expired
[14:43:37] <godog>	 fixing
[14:43:40] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[14:45:19] <wikibugs>	 (03CR) 10Bstorm: [C: 032] openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm)
[14:48:49] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776
[14:49:14] <tgr>	 Amir1: apparently 1617150 passwords got prefixed on commons which is a bit scary
[14:49:33] <tgr>	 we do know from other wikis that it does not affect real users, right?
[14:51:06] <wikibugs>	 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) 05Open>03Resolved I checked again the server logs, everything looks good. Resolving this task.
[14:53:42] <icinga-wm>	 PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:56:27] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10herron)
[14:56:30] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10herron)
[14:58:16] <wikibugs>	 (03PS1) 10Bstorm: Revert "openstack: add case for stretch and newton in client repos" [puppet] - 10https://gerrit.wikimedia.org/r/463779
[14:58:32] <mutante>	  /away
[14:59:53] <wikibugs>	 (03PS3) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768
[15:00:09] <akosiaris>	 !log upgrade etherpad to 1.7.0-2 
[15:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:21] <wikibugs>	 (03CR) 10Bstorm: [C: 032] Revert "openstack: add case for stretch and newton in client repos" [puppet] - 10https://gerrit.wikimedia.org/r/463779 (owner: 10Bstorm)
[15:00:44] <wikibugs>	 (03PS4) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768
[15:01:27] <wikibugs>	 (03CR) 10Elukey: [C: 032] Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 (owner: 10Elukey)
[15:03:29] <chasemp>	 Nikerabbit: are you about?
[15:03:57] <wikibugs>	 (03CR) 10Anomie: wiki replicas: Remove most comment joins from non-compat tables (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm)
[15:04:18] <wikibugs>	 (03PS38) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[15:04:20] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2046 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[15:04:20] <wikibugs>	 (03PS9) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885)
[15:05:11] <icinga-wm>	 PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:05:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[15:05:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[15:06:02] <Nikerabbit>	 chasemp: yeah
[15:06:35] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968)
[15:09:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:10:45] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968)
[15:12:21] <wikibugs>	 (03PS1) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463781 (https://phabricator.wikimedia.org/T202962)
[15:15:28] <wikibugs>	 (03PS1) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254)
[15:16:04] <jynus>	 !log stopping db1064 to clone it to dbstore1001
[15:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:30] <wikibugs>	 (03Abandoned) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463781 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[15:19:00] <Amir1>	 tgr: yup, I already did it on Persian Wikipedia, rather medium wiki
[15:19:11] <icinga-wm>	 RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[15:19:20] <Amir1>	 and gave them a heads up to contact me if anything goes wrong
[15:19:31] <Amir1>	 and today deleted those
[15:19:41] <wikibugs>	 (03PS1) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463783 (https://phabricator.wikimedia.org/T202962)
[15:19:48] <Amir1>	 it's already deleted on mediawikiwiki 
[15:20:27] <wikibugs>	 (03PS2) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254)
[15:21:54] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463783 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[15:22:49] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392)
[15:23:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Not until all servers are back in a good state (inc. repl lag)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo)
[15:27:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12696/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[15:28:00] <icinga-wm>	 PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb]
[15:28:05] <elukey>	 this si me again --^
[15:28:08] <wikibugs>	 10Operations, 10MediaWiki-Maintenance-scripts, 10Core Platform Team Kanban (Watching / External): sql enwik gives a poor error message when db doesn't exist - https://phabricator.wikimedia.org/T199008 (10CCicalese_WMF)
[15:28:20] <icinga-wm>	 PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:28:47] <wikibugs>	 10Operations, 10monitoring, 10Core Platform Team Kanban (Watching / External), 10Wikimedia-Incident: Add alerts for Logstash rates in production - https://phabricator.wikimedia.org/T199479 (10CCicalese_WMF)
[15:29:34] <wikibugs>	 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Use PHP7 for web requests on jobrunner servers - https://phabricator.wikimedia.org/T195392 (10CCicalese_WMF)
[15:30:35] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392)
[15:30:37] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392)
[15:30:39] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392)
[15:35:41] <icinga-wm>	 RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:39:18] <wikibugs>	 (03PS5) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785
[15:42:09] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:42:49] <wikibugs>	 (03CR) 10Vgutierrez: api: Also handle SIGHUP signals to the API process (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk)
[15:46:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational
[15:46:29] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[15:46:29] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[15:46:29] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:46:30] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[15:46:49] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[15:47:59] <icinga-wm>	 RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational
[15:53:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo)
[15:55:37] <wikibugs>	 (03CR) 10Vgutierrez: api: Also handle SIGHUP signals to the API process (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk)
[15:57:23] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) p:05Triage>03Normal
[15:59:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889)
[16:01:19] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889)
[16:01:22] <wikibugs>	 (03CR) 10Bstorm: "That should cut back nicely.  It makes me really want to know exactly what "too few" looks like.  I guess if api calls start getting hung " [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott)
[16:02:40] <wikibugs>	 (03CR) 10Bstorm: [C: 031] Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott)
[16:04:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott)
[16:05:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] "This is messy but it's my mess" [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm)
[16:07:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: allow queries to the nova API by novaobserver [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177)
[16:09:26] <Krenair>	 arturo, admin or novaobserver?
[16:09:29] <Krenair>	 not sure this makes sense
[16:09:37] <Krenair>	 should just be opened to all if you're allowing novaobserver in
[16:11:10] <arturo>	 Krenair: well, novaobserver is an authenticated user. I think that's different that having the API fully public?
[16:11:19] <Krenair>	 no
[16:11:23] <Krenair>	 it's a guest user
[16:11:28] <Krenair>	 the password is public knowledge
[16:11:58] <Krenair>	 the API doesn't allow unauthenticated requests AFAIK so we added that as a read-only user
[16:13:41] <icinga-wm>	 RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[16:15:24] <andrewbogott>	 Yeah, I think that "" is fine policy for those things unless there's some UI reason why we don't want the values displaying in Horizon
[16:16:12] <andrewbogott>	 Although to be honest I don't understand what those policies (without verbs) really control
[16:17:22] <andrewbogott>	 Are those definitely policies that just control 'list' or do they provide access to a whole suite of commands?
[16:17:55] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Dzahn)
[16:18:31] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: deploy PFW policy commit 99eb6f026 - https://phabricator.wikimedia.org/T205888 (10Jgreen) p:05Triage>03Normal
[16:19:11] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Aklapper) Maybe this is about https://phabricator.wikimedia.org/S4 ? Needs clarification from @Mathew.onipe.
[16:21:43] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Gehel) Yes, I can confirm, this is about S4. @Mathew.onipe will work in particular on replacing our elasticsearch servers, so he should be able to follow when those will be or...
[16:23:24] <wikibugs>	 10Operations, 10Core Platform Team Kanban, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213 (10CCicalese_WMF)
[16:23:34] <wikibugs>	 10Operations, 10Core Platform Team Kanban, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10CCicalese_WMF)
[16:24:12] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudvps: unprotect some nova API queries [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177)
[16:26:33] <logmsgbot>	 !log ppchelko@deploy1001 Started restart [cpjobqueue/deploy@58f9ed3]: Fix KafkaConsumer not connected error
[16:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:21] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[16:31:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[16:35:45] <wikibugs>	 (03PS1) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794
[16:46:05] <wikibugs>	 (03PS2) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794
[16:46:34] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10CCicalese_WMF)
[16:46:58] <wikibugs>	 (03CR) 10Paladox: "I tested locally and works in python3 :)" [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox)
[16:47:51] <paladox>	 godog volans i managed to convert ircecho to python3 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463794/ (tested locally and works)
[16:47:55] <paladox>	 it may fix https://phabricator.wikimedia.org/T205522
[16:50:48] <godog>	 paladox: thanks! looks simple enough, did you have a chance to test it with garbage data like the problem in T205522 ?
[16:50:49] <stashbot>	 T205522: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522
[16:50:59] <godog>	 it feels like we're putting lipstick on a pig tho
[16:51:09] <paladox>	 godog haven't tested it with garbage data.
[16:51:29] <wikibugs>	 10Operations, 10PoolCounter, 10monitoring, 10Core Platform Team Kanban (Watching / External), 10Wikimedia-Incident: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10CCicalese_WMF)
[16:51:31] <Krenair>	 lol
[16:52:53] <godog>	 paladox: ack, thanks!
[16:55:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: ensure permissions [puppet] - 10https://gerrit.wikimedia.org/r/463795 (https://phabricator.wikimedia.org/T203177)
[16:56:06] <paladox>	 godog garbage stuff like https://wm-bot.wmflabs.org/logs/%23wikidata/20180926.txt ?
[16:57:00] <godog>	 paladox: perhaps, anything that triggers UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 92: invalid start byte
[16:57:06] <paladox>	 ok
[16:57:10] <godog>	 see also https://phabricator.wikimedia.org/T205522#4619588
[16:57:13] <godog>	 gotta go
[16:58:18] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo)
[16:58:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: ensure permissions [puppet] - 10https://gerrit.wikimedia.org/r/463795 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[16:59:23] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) a:03Volans
[16:59:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo)
[17:00:04] <jouncebot>	 gehel: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1700).
[17:00:18] <gehel>	 jouncebot: o/
[17:00:34] <gehel>	 onimisionipe: ^^ let's do this together
[17:00:51] <onimisionipe>	 gehel: alright!. bring it on!
[17:01:30] <wikibugs>	 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) The change is out, the debian package works as expected, but the template generation had a typo which was fixed in https://gerrit.wikimedia.org/r/#/c/operati...
[17:01:51] <wikibugs>	 10Operations: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 (10Volans) p:05Triage>03Normal
[17:02:06] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans) p:05Triage>03Normal a:03Volans
[17:02:40] <jynus>	 !log stopping some mariadb instances on dbstore1001 and starting compression T201392
[17:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:44] <stashbot>	 T201392: Finish eqiad metadata database backup setup (s1-s8, x1) - https://phabricator.wikimedia.org/T201392
[17:03:02] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Jdrewniak)
[17:03:04] <wikibugs>	 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) p:05Triage>03Normal
[17:03:49] <wikibugs>	 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) p:05Triage>03Normal
[17:05:37] <wikibugs>	 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) p:05Triage>03Normal
[17:05:49] <wikibugs>	 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) a:05Volans>03None
[17:05:59] <wikibugs>	 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) a:05Volans>03None
[17:06:29] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093, db1064 (duration: 00m 57s)
[17:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:52] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo)
[17:09:10] <wikibugs>	 10Operations, 10Operations-Software-Development: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10Volans) p:05Triage>03Normal
[17:09:22] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 59.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:11:32] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 72.87 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:11:57] <wikibugs>	 (03PS1) 10Elukey: profile::piwik::webserver: install php7.0-mbstring on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/463797 (https://phabricator.wikimedia.org/T202962)
[17:13:07] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::piwik::webserver: install php7.0-mbstring on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/463797 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[17:17:20] <wikibugs>	 (03PS6) 10Jdlrobson: Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk)
[17:17:42] <wikibugs>	 (03PS3) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794
[17:26:31] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.95 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:27:07] <logmsgbot>	 !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@a637583]: Test deployment for recent updater build and GUI changes. Also blazegraph updates(wdqs1009)
[17:27:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:29] <wikibugs>	 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10GTirloni) The amount of spam seems to have reduced significantly in the last couple of days.  Emails blocked:  ```    2457 Blocked by DNSBL...
[17:28:32] <wikibugs>	 (03PS1) 10Jdlrobson: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792)
[17:28:34] <wikibugs>	 (03PS1) 10Jdlrobson: Minerv page issues A/B test to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792)
[17:28:53] <logmsgbot>	 !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@a637583]: Test deployment for recent updater build and GUI changes. Also blazegraph updates(wdqs1009) (duration: 01m 46s)
[17:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:51] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.18 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:31:04] <wikibugs>	 (03PS1) 10Jdlrobson: Page issues A/B test to 20% of users (Start the a/b test!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463807 (https://phabricator.wikimedia.org/T200792)
[17:39:36] <wikibugs>	 (03PS1) 10Elukey: matomo: replace last piwik occurrences [puppet] - 10https://gerrit.wikimedia.org/r/463810 (https://phabricator.wikimedia.org/T202962)
[17:40:08] <wikibugs>	 (03CR) 10Elukey: [C: 032] matomo: replace last piwik occurrences [puppet] - 10https://gerrit.wikimedia.org/r/463810 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey)
[17:57:24] <XioNoX>	 !log push fw change on pfw3-codfw - T205888
[17:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:28] <stashbot>	 T205888: deploy PFW policy commit 99eb6f026 - https://phabricator.wikimedia.org/T205888
[17:59:26] <XioNoX>	 !log push fw change on pfw3-eqiad - T205888
[17:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:04] <jouncebot>	 Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1800)
[18:00:04] <jouncebot>	 stephanebisson, kostajh, Amir1, and Jdlrobson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:16] <Amir1>	 o/ mine is not testable
[18:00:22] <Amir1>	 on maintenance script 
[18:00:30] <RoanKattouw>	 I'll do the SWAT today
[18:01:30] <RoanKattouw>	 jdlrobson: You around?
[18:02:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.91 seconds
[18:04:20] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Dzahn) The current "custom policy" on S4 says this:  - allow administrators - allow members of any project: acl*sre-team, acl*procurement-review - allow users: @RobH   I assum...
[18:05:56] <jdlrobson>	 \o
[18:06:10] <jdlrobson>	 RoanKattouw: i need to do 1 of those swats earlier in the window and the other later
[18:06:18] <RoanKattouw>	 Yup I saw
[18:06:20] <jdlrobson>	 to give me some time to check the grafana graphs in bettwen
[18:06:21] <RoanKattouw>	 I'll do the first one right now
[18:06:23] <jdlrobson>	 cool! thank you :)
[18:06:30] <wikibugs>	 (03CR) 10Catrope: [C: 032] Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:07:22] <kostajh>	 RoanKattouw: I'm here
[18:08:04] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:09:32] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 88.87 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[18:10:13] <RoanKattouw>	 jdlrobson: First patch is on mwdebug2001, not sure if there's any testing you can do there, but if there is please do it
[18:10:26] <jdlrobson>	 RoanKattouw: will do won't take long
[18:11:10] <jdlrobson>	 ok we're good to go! please merge RoanKattouw  :)
[18:11:25] <RoanKattouw>	 kostajh: Yours is on mwdebug2001 (AKA mw2017 in the tool)
[18:11:37] <kostajh>	 RoanKattouw: thanks, will test now
[18:12:24] <RoanKattouw>	 kostajh: While you're at it, could you test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/463680 too?
[18:12:53] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 5% rate (T200792) (duration: 00m 59s)
[18:12:55] <kostajh>	 RoanKattouw: sure
[18:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:58] <stashbot>	 T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792
[18:15:34] <kostajh>	 RoanKattouw: both look good to me
[18:16:12] <wikibugs>	 (03CR) 10jenkins-bot: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:17:48] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/PageTriage/: Ensure valid AFC option is selected (T205324, T205168); hide copyvio behind a global var and URL param (duration: 00m 57s)
[18:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:54] <stashbot>	 T205168: AfC:  clicking 'Set filters' link will automatically change the sorting option to 'Created date (newest)'  - https://phabricator.wikimedia.org/T205168
[18:17:55] <stashbot>	 T205324: [wmf.22] AfC - Sort option 'Created date(newest)' displayed for default/reload - https://phabricator.wikimedia.org/T205324
[18:18:00] <jdlrobson>	 RoanKattouw: will be back from the graphs in 20 mins :)
[18:21:06] <Amir1>	 RoanKattouw: just to confirm, is my patch deployed?
[18:22:23] <RoanKattouw>	 Not yet, it was only just merged
[18:23:02] <Amir1>	 okay, noted
[18:23:05] <RoanKattouw>	 Deploying now
[18:23:52] <Amir1>	 Thanks. Once you're done with SWAT please take a look at https://phabricator.wikimedia.org/T205904, we might SWAT it if there will be time (Maybe evening SWAT)
[18:23:56] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.32.0-wmf.23/maintenance/includes/DeleteLocalPasswords.php: T201009 (duration: 00m 56s)
[18:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:01] <stashbot>	 T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009
[18:27:47] <wikibugs>	 (03PS2) 10Jdlrobson: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792)
[18:28:27] <Amir1>	 !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=enwiki --prefix (T201009)
[18:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:32] <Amir1>	 I'm monitoring s1
[18:29:47] <wikibugs>	 (03CR) 10Ppchelko: [C: 031] "A related patch for change-prop https://github.com/wikimedia/change-propagation/pull/292" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac)
[18:30:34] <jdlrobson>	 RoanKattouw: ready when you are for the other patch
[18:30:44] <jdlrobson>	 it's looking good
[18:30:59] <wikibugs>	 (03CR) 10Catrope: [C: 032] Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:32:02] <wikibugs>	 (03Merged) 10jenkins-bot: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:33:40] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 20% rate (T200792) (duration: 00m 56s)
[18:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:45] <stashbot>	 T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792
[18:33:49] <RoanKattouw>	 jdlrobson: ---^^
[18:33:51] <RoanKattouw>	 And that's the SWAT done
[18:36:02] <jdlrobson>	 thanks RoanKattouw  :)
[18:36:20] <jdlrobson>	 waiting for the graphs to quadruple..
[18:36:35] <wikibugs>	 (03PS1) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184)
[18:37:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:38:09] <wikibugs>	 (03PS2) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184)
[18:39:03] <wikibugs>	 (03CR) 10jenkins-bot: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson)
[18:39:37] <wikibugs>	 (03PS3) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184)
[18:40:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:41:11] <volans>	 sorry for the spam above, I'm blind today
[18:41:14] <wikibugs>	 (03PS4) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184)
[18:44:44] <wikibugs>	 (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:47:50] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "looks good, since we are not using the $netbox_media variable in the bacula::director class though, let's add a warning comment that if th" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:49:08] <wikibugs>	 (03PS5) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184)
[18:49:19] <wikibugs>	 (03CR) 10Volans: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:49:40] <wikibugs>	 (03CR) 10Dzahn: [C: 031] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:50:25] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10herron) p:05Triage>03Normal
[18:53:06] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10thcipriani)
[18:54:53] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo)
[18:55:17] <wikibugs>	 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui I'm not sure if this helps, but I'll try to better illustrate my question us...
[18:58:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:58:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 031] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans)
[18:59:11] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo) There is a base set of npm packages that are used by all services. Currently, server.js installs heap...
[18:59:55] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo)
[19:11:42] <thcipriani>	 !log restarting ci jenkins for new plugins
[19:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:11] <_joe_>	 jouncebot: next
[19:19:11] <jouncebot>	 In 0 hour(s) and 40 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2000)
[19:20:42] <wikibugs>	 (03PS1) 10Herron: admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543)
[19:20:44] <wikibugs>	 (03PS1) 10Herron: admin: add onimisionipe to group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543)
[19:20:46] <wikibugs>	 (03PS1) 10Herron: wdqs: add wdqs-roots group to wdqs common role [puppet] - 10https://gerrit.wikimedia.org/r/463837 (https://phabricator.wikimedia.org/T205543)
[19:21:53] <wikibugs>	 (03PS2) 10Herron: admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543)
[19:23:58] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Content-negotiation filter going live T128040
[19:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:03] <stashbot>	 T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040
[19:27:36] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Content-negotiation filter going live T128040 (duration: 03m 38s)
[19:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "'os_compute_api:os-hypervisors' is not a safe thing to open to the public." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez)
[19:50:48] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani)
[19:51:40] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) p:05Triage>03Normal
[19:51:43] <logmsgbot>	 !log gehel@deploy1001 Started deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (wdqs1009 only)
[19:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:13] <logmsgbot>	 !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (wdqs1009 only) (duration: 00m 30s)
[19:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:34] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10thcipriani)
[19:53:28] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani)
[19:53:32] <wikibugs>	 10Operations, 10Release Pipeline, 10Epic, 10Release-Engineering-Team (Kanban), 10Services (watching): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani)
[19:54:53] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani)
[19:55:00] <wikibugs>	 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10thcipriani)
[19:58:43] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani)
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2000).
[20:00:14] <wikibugs>	 (03PS1) 10Cwhite: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782)
[20:00:19] <subbu>	 in ~15-20 mins
[20:00:35] <subbu>	 parsoid deploy, i mean. :)
[20:04:51] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:13:41] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[20:16:51] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 55.95 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:21:25] <logmsgbot>	 !log gehel@deploy1001 Started deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph
[20:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:22] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 54.2 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:32:01] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 70.36 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:35:25] <logmsgbot>	 !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (duration: 14m 00s)
[20:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:22] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 53.51 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:37:36] <logmsgbot>	 !log arlolra@deploy1001 Started deploy [parsoid/deploy@8ff45db]: Updating Parsoid to 224ecde
[20:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:13] <Amir1>	 godog: hey, when you have some time. It would be great if we can talk about logstash. I'm doing this: https://phabricator.wikimedia.org/T181630
[20:39:28] <Amir1>	 akosiaris: ^
[20:39:43] <Amir1>	 It's really hard to test it
[20:39:58] <wikibugs>	 (03CR) 10Dzahn: "nice! yea, that's a nice way to keep the customizations. and a long-standing FIXME. some minor inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[20:44:07] <wikibugs>	 (03PS2) 10Cwhite: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782)
[20:45:02] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.83 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:45:58] <logmsgbot>	 !log arlolra@deploy1001 Finished deploy [parsoid/deploy@8ff45db]: Updating Parsoid to 224ecde (duration: 08m 22s)
[20:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:32] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Watching / External), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall)
[20:55:03] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) a:03dduvall
[20:56:47] <_joe_>	 arlolra, subbu please let me know if you think I should rework that patch, and in what direction; If it's not ready by tomorrow, I will revert parsoid to use HTTP for now; but I'd really like to get some version of it out before the end of the week, so that we can make the switchover more seamless
[20:57:20] <subbu>	 _joe_, we'll get it deployed tomorrow.
[20:57:36] <subbu>	 we just ran out of time for today since we were doing a sensitive deploy.
[20:57:46] <_joe_>	 yeah, don't worry :)
[20:59:38] <wikibugs>	 10Operations, 10Datacenter-Switchover-2018, 10Discovery-Search (Current work): Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10EBernhardson)
[21:00:04] <jouncebot>	 bawolff and Reedy: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2100).
[21:02:37] <wikibugs>	 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) a:03ayounsi
[21:02:54] <wikibugs>	 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) a:03ayounsi
[21:05:33] <wikibugs>	 (03CR) 10Dzahn: icinga: move init script customizations to default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[21:06:27] <wikibugs>	 (03CR) 10Dzahn: [C: 031] icinga: move init script customizations to default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[21:07:10] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) As of this morning both Jenkins master have the Prometheus plugin installed and enabled. The...
[21:08:13] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "i looked at the diff between the init script from the stretch distro package and the current custom file in the puppet repo. these are the" [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[21:17:06] <arlolra>	 !log Updated Parsoid to 224ecde (T198504, T133673, T202666)
[21:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:14] <stashbot>	 T198504: [BUG] Audio file shows incomplete toolbar in an article page - https://phabricator.wikimedia.org/T198504
[21:17:15] <stashbot>	 T202666: html2wt endpoint should handle mismatching content versions - https://phabricator.wikimedia.org/T202666
[21:17:16] <stashbot>	 T133673: Add width/height attributes to the <audio><video> tag - https://phabricator.wikimedia.org/T133673
[21:17:50] <wikibugs>	 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) The logs mentioned during the meeting seem to be the link between a2 and a8 flapping (possibly faulty optic) and VC members re-calculating paths around th...
[21:21:51] <wikibugs>	 10Operations, 10ops-requests: neon ran out of inodes again, needs cron job to cleanup - https://phabricator.wikimedia.org/T82651 (10Dzahn)
[21:22:48] <wikibugs>	 10Operations, 10ops-requests: Investigate Puppet Freshness Problems - https://phabricator.wikimedia.org/T82696 (10Dzahn)
[21:23:18] <mutante>	 ^ these are ancient things, you can ignore them. i am just making things public that can be and came from RT
[21:23:40] <mutante>	 because i ran into it searching for old icinga related things that we still care about today
[21:24:19] <wikibugs>	 (03PS3) 10Herron: WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785)
[21:25:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[21:51:31] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 57.44 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:54:10] <wikibugs>	 (03PS1) 10Jgreen: enable nsca collection for frack check_timesync metric [puppet] - 10https://gerrit.wikimedia.org/r/463854
[21:55:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.75 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:56:23] <wikibugs>	 (03CR) 10Jgreen: [C: 032] enable nsca collection for frack check_timesync metric [puppet] - 10https://gerrit.wikimedia.org/r/463854 (owner: 10Jgreen)
[21:56:58] <wikibugs>	 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Traffic: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Niharika) p:05Triage>03Normal
[21:57:25] <wikibugs>	 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) I did some debug, and found solution for both of the problems:  //**wmf-pt-kill not killing queries:**//  - The config file contains the MATCH_USER and MATCH...
[22:23:27] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: deploy PFW policy commit 99eb6f026 - https://phabricator.wikimedia.org/T205888 (10Jgreen) 05Open>03Resolved
[22:29:16] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@babfe80]: Don't log the request for transform failures
[22:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:47] <wikibugs>	 (03PS3) 10Dzahn: icinga::naggen/web/raid/ores: avoid out-of-scope-vars everywhere [puppet] - 10https://gerrit.wikimedia.org/r/463404
[22:32:03] <wikibugs>	 (03PS3) 10Cwhite: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782)
[22:37:13] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12701/" [puppet] - 10https://gerrit.wikimedia.org/r/463404 (owner: 10Dzahn)
[22:38:53] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services (watching): TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10Pchelolo)
[22:41:11] <wikibugs>	 10Operations, 10ops-eqiad: neon has bad memory (or disk?) - https://phabricator.wikimedia.org/T82983 (10Dzahn)
[22:41:40] <wikibugs>	 10Operations, 10ops-eqiad: Neon sdb is failing - https://phabricator.wikimedia.org/T82614 (10Dzahn)
[22:41:42] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@babfe80]: Don't log the request for transform failures (duration: 12m 27s)
[22:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:53] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@babfe80]: Don't log the request for transform failures, take 2, feeds check timeouts
[22:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:49] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "noop on einsteinium:  https://puppet-compiler.wmflabs.org/compiler1002/12703/" [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[22:43:56] <wikibugs>	 (03PS4) 10Dzahn: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[22:45:50] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@babfe80]: Don't log the request for transform failures, take 2, feeds check timeouts (duration: 03m 57s)
[22:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:33] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@babfe80]: Don't log the request for transform failures, take 3, feeds check timeouts
[22:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:18] <wikibugs>	 (03CR) 10Dzahn: [C: 031] admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron)
[22:47:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:47:37] <wikibugs>	 (03CR) 10Dzahn: [C: 031] wdqs: add wdqs-roots group to wdqs common role [puppet] - 10https://gerrit.wikimedia.org/r/463837 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron)
[22:48:04] <wikibugs>	 (03CR) 10Dzahn: [C: 031] admin: add onimisionipe to group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron)
[22:48:18] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463865
[22:48:20] <wikibugs>	 (03CR) 10QChris: [V: 032 C: 032] Allow “Gerrit Managers” to import history [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463865 (owner: 10QChris)
[22:48:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Invalid relationship: Service[icinga] { require =" [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[22:49:28] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "approved in SRE meeting today" [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543) (owner: 10Herron)
[22:51:16] <wikibugs>	 (03CR) 10Catrope: [C: 04-2] "Blocked on wmf.23 being deployed on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson)
[22:52:55] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@babfe80]: Don't log the request for transform failures, take 3, feeds check timeouts (duration: 06m 22s)
[22:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:21] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463867
[22:54:23] <wikibugs>	 (03CR) 10QChris: [V: 032 C: 032] Import done. Revoke import grants [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463867 (owner: 10QChris)
[22:59:02] <wikibugs>	 (03PS1) 10Cwhite: icinga: service management depends on debian version [puppet] - 10https://gerrit.wikimedia.org/r/463868 (https://phabricator.wikimedia.org/T202782)
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2300).
[23:00:05] <jouncebot>	 davidwbarratt: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:17] <davidwbarratt>	 I'm here!
[23:02:02] <davidwbarratt>	 who is SWATing?
[23:06:18] <davidwbarratt>	 hello?
[23:08:29] <dmaza>	 hello :p
[23:10:44] <davidwbarratt>	 ping addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof:
[23:11:00] <MaxSem>	 I'm in a meeting, sorry
[23:11:03] <twentyafterfour>	 I can swat
[23:11:06] <davidwbarratt>	 no problem
[23:11:09] <wikibugs>	 10Operations, 10ops-ulsfo, 10netops: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10ayounsi) p:05Triage>03Normal
[23:11:11] <davidwbarratt>	 twentyafterfour thanks!
[23:11:27] <davidwbarratt>	 twentyafterfour it's a SQL patch to create a table
[23:12:04] <twentyafterfour>	 have we gotten any dba review of the change? 
[23:12:17] <davidwbarratt>	 yes, it's already merged into master
[23:12:51] <davidwbarratt>	 here's the task https://phabricator.wikimedia.org/T197144 and the DBA review(s) https://phabricator.wikimedia.org/T193449
[23:12:59] <davidwbarratt>	 but it is not a "schema change"
[23:13:17] <davidwbarratt>	 https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change
[23:13:30] <twentyafterfour>	 ah I see
[23:13:43] <twentyafterfour>	 ok just creating a table should be fine, indeed 
[23:14:35] <RoanKattouw>	 Looks to me like you should run foreachwiki sql.php maintenance/postgres/archives/patch-ipblocks_restrictions-table.sql  , right?
[23:14:44] <RoanKattouw>	 Uh, except without the /postgres/ bit
[23:14:51] <wikibugs>	 (03CR) 10Cwhite: [C: 032] "NOOP on existing: https://puppet-compiler.wmflabs.org/compiler1001/12705/" [puppet] - 10https://gerrit.wikimedia.org/r/463868 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite)
[23:14:52] <davidwbarratt>	 yep, just this: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/440871/21/maintenance/archives/patch-ipblocks_restrictions-table.sql
[23:15:36] <twentyafterfour>	 ok do we need to cherry pick anything to the branch or was this already merged prior to branch cut?
[23:15:43] * twentyafterfour looks at date of merge
[23:15:54] <RoanKattouw>	 "Included in" says wmf.22 and wmf.23
[23:16:00] <davidwbarratt>	 it's already merged, it should be deployed already
[23:16:05] <twentyafterfour>	 yeah cool 
[23:16:13] <twentyafterfour>	 so just the foreachwiki should do it 
[23:16:34] <Reedy>	 gonna have to be run from a codfw mtx host though, not deploy1001
[23:16:51] <RoanKattouw>	 What is the codfw maintenance host? mwmaint2001?
[23:16:55] <Reedy>	 yeah
[23:17:06] <Reedy>	 codfw is the active, but deployment still is in eqiad
[23:17:06] <RoanKattouw>	 Also, for completeness, let me correct my mistake:  foreachwiki sql.php maintenance/archives/patch-ipblocks_restrictions-table.sql
[23:17:11] <twentyafterfour>	 not sure if I have access to that? 
[23:17:14] * twentyafterfour checks
[23:17:18] <Reedy>	 you will :)
[23:17:20] <RoanKattouw>	 OK cool. I've never been on mwmaint2001, only on mwmaint1001
[23:17:32] <Reedy>	 mwmaint2001.codfw.wmnet obvs
[23:18:12] <davidwbarratt>	 I assume that creating the tables will persist after the datacenter switch-backover ?
[23:18:21] <twentyafterfour>	 !log creating ipblocks_restrictions table (command run on mwmaint2001: foreachwiki sql.php maintenance/archives/patch-ipblocks_restrictions-table.sql)
[23:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:31] <twentyafterfour>	 davidwbarratt: it should
[23:18:42] <davidwbarratt>	 good
[23:18:48] <Reedy>	 we're still replicating across DC
[23:18:53] <davidwbarratt>	 ah
[23:18:54] <Reedy>	 just the masters are currently in codfw
[23:19:06] <Reedy>	 https://dbtree.wikimedia.org/ is a nice illustration of it
[23:19:35] <James_F>	 RoanKattouw: Welcome, newbie. ;-)
[23:19:50] <twentyafterfour>	 ok the queries are running
[23:20:00] <davidwbarratt>	 yay!
[23:20:09] <twentyafterfour>	 so far all say " Query OK, 0 row(s) affected"
[23:20:30] <Reedy>	 That's normal IIRC
[23:20:44] <twentyafterfour>	 is there an easy way to confirm that the table is created?
[23:21:01] <Reedy>	 sql aawiki
[23:21:07] <Reedy>	 explain ipblocks_restrictions;
[23:21:16] <Reedy>	 I've just confirmed it's there on the eqiad hosts
[23:21:40] <twentyafterfour>	 Cool, Thanks Reedy!
[23:21:46] <James_F>	 Table exists on codfw aawiki too.
[23:21:54] <twentyafterfour>	 still running, there are a lot of wikis ;) 
[23:21:56] <Reedy>	 James_F: I'd be amazed if they didn't :P
[23:22:03] <Reedy>	 That would be some witchcraft
[23:22:36] <twentyafterfour>	 replag == witchcraft
[23:22:48] <James_F>	 Reedy: I've seen some stuff you wouldn't believe. RepLag, C-beams glittering off the belt of Orion, etc.
[23:22:54] <James_F>	 Bah, twentyafterfour got there first.
[23:23:02] <twentyafterfour>	 hah 
[23:23:02] <Reedy>	 Are you planning on deploying code that uses that table soon after?
[23:23:30] <davidwbarratt>	 we can't until the schema change is complete (i.e. the patch that changes a column)
[23:24:00] <davidwbarratt>	 https://phabricator.wikimedia.org/T204006
[23:24:34] <Reedy>	 fair
[23:27:50] <icinga-wm>	 PROBLEM - puppet last run on wdqs1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup]
[23:30:01] <davidwbarratt>	 twentyafterfour still running?
[23:30:21] <Reedy>	 probably...  over 900 wikis takes a while :)
[23:30:39] <twentyafterfour>	 davidwbarratt: yeah almost done I think 
[23:30:57] <twentyafterfour>	 at zawiki
[23:31:06] <twentyafterfour>	 and now it's done
[23:31:16] <mutante>	 herron: is the wdqs error about the new group?
[23:31:24] <twentyafterfour>	 !log finished creating database tables
[23:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:48] <wikibugs>	 (03Abandoned) 10MacFan4000: Remove MW 1.29 from ExtDist as it is now no longer supported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440745 (owner: 10MacFan4000)
[23:33:25] <mutante>	 herron: ignore it. it is definitely not. i see an unrelated issue
[23:34:05] <herron>	 mutante ok, I didn’t merge that yet fwiw
[23:34:26] <mutante>	 herron: yea, i see it. it just seemed like it because "groups-cleanup" 
[23:34:37] <davidwbarratt>	 YAY!
[23:34:46] <davidwbarratt>	 thanks twentyafterfour !
[23:34:51] <mutante>	 but the issue that somebody manually added a user or installed softare
[23:35:06] <twentyafterfour>	 davidwbarratt: You're welcome. Thanks to Reedy and RoanKattouw for helping out. 
[23:35:28] <herron>	 mutante ah! gotcha
[23:35:48] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Shell, 10Core Platform Team Kanban (Later): Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10CCicalese_WMF)
[23:36:24] <mutante>	 SMalyshev: there is a puppet problem on wdqs1010. was a user "virtuoso" created manually by any chance?
[23:43:18] <foks>	 !log disabling 2FA for two users
[23:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log