[01:38:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:04:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:18:22] PROBLEM - HHVM rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:22] RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 80626 bytes in 0.157 second response time [02:47:03] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:02:22] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:12:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:18:32] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:21:22] !log restarting inplace reindexing of enwiki and viwiki at codfw - T204362 [03:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:28] T204362: Resolve elasticsearch shard size alert by doing an in place reindex - https://phabricator.wikimedia.org/T204362 [03:28:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 801.77 seconds [04:08:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.29 seconds [05:07:23] !log Deploy schema change on s1 codfw msater - T203709 [05:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:29] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [05:12:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 [05:13:54] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui) [05:15:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 [05:15:48] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@07cbfb4]: Update mobileapps to a1fa41b [05:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui) [05:17:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui) [05:19:06] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@07cbfb4]: Update mobileapps to a1fa41b (duration: 03m 18s) [05:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 (duration: 01m 01s) [05:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:37] !log Stop replication on dbstore1002 and db1103:3312 in sync [05:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463699 (owner: 10Marostegui) [05:30:16] (03CR) 10Marostegui: [C: 031] mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [05:30:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 [05:32:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui) [05:33:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui) [05:35:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 (duration: 00m 56s) [05:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463701 (owner: 10Marostegui) [05:43:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:53:01] (03PS4) 10Giuseppe Lavagetto: profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 [05:53:42] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 (owner: 10Giuseppe Lavagetto) [05:54:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:00:18] (03PS4) 10Giuseppe Lavagetto: parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 [06:01:09] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 (owner: 10Giuseppe Lavagetto) [06:14:17] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: no conftool in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463704 [06:15:31] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::lvs::realserver: no conftool in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463704 (owner: 10Giuseppe Lavagetto) [06:18:02] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:20:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:32:43] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste] [06:32:52] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/pybal-check] [06:33:49] 10Operations: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10Joe) [06:33:53] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_established_connections] [06:39:12] !log mholloway-shell@deploy1001 Started deploy [tilerator/deploy@22f90ee] (maps1004): Update tilerator to latest (T205462) [06:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:19] T205462: Load OSM data into maps1004 - https://phabricator.wikimedia.org/T205462 [06:39:31] !log mholloway-shell@deploy1001 Finished deploy [tilerator/deploy@22f90ee] (maps1004): Update tilerator to latest (T205462) (duration: 00m 19s) [06:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:12] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:12] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:59:58] !log mholloway-shell@deploy1001 Started deploy [kartotherian/deploy@ab6cb74] (maps1004): Update kartotherian to latest (T205462) [07:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:03] T205462: Load OSM data into maps1004 - https://phabricator.wikimedia.org/T205462 [07:00:15] !log mholloway-shell@deploy1001 Finished deploy [kartotherian/deploy@ab6cb74] (maps1004): Update kartotherian to latest (T205462) (duration: 00m 16s) [07:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 34 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:10:12] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:14:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 28 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:24:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 22 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:28:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 50 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:30:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:31:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:36:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:38:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:41:54] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10valerio.bozzolan) Apparently yes. [07:44:01] (03PS3) 10Giuseppe Lavagetto: parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 [07:44:32] https://phabricator.wikimedia.org/T205636 this is a very serious issue [07:45:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:34] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto) [07:52:20] (03PS2) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [07:52:22] (03PS1) 10Alexandros Kosiaris: scap: Replace an ugly hack with puppet 4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/463708 (https://phabricator.wikimedia.org/T204907) [07:52:24] (03PS1) 10Alexandros Kosiaris: WIP: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) [07:54:03] PROBLEM - confd service on wtp1045 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [07:54:26] !log disabling puppet on labsdb1009, labsdb1010, labsdb1011 [07:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:51] !log disabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983) [07:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:55] T183983: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 [07:55:16] <_joe_> the confd thing on wtp1045 is me [07:55:41] (03PS1) 10Giuseppe Lavagetto: parsoid: fixup for the https uri [puppet] - 10https://gerrit.wikimedia.org/r/463710 [07:57:30] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: fixup for the https uri [puppet] - 10https://gerrit.wikimedia.org/r/463710 (owner: 10Giuseppe Lavagetto) [07:58:20] (03CR) 10Banyek: [C: 032] wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [08:10:42] (03PS30) 10Banyek: wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [08:10:48] (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [08:14:00] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 33 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:18:40] PROBLEM - Check systemd state on labsdb1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:19:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:22:46] (03PS2) 10Elukey: Add an-coord1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463297 (https://phabricator.wikimedia.org/T204970) [08:24:19] (03CR) 10Elukey: [C: 032] Add an-coord1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463297 (https://phabricator.wikimedia.org/T204970) (owner: 10Elukey) [08:28:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can be found in `/var... [08:32:50] arturo: I am around now, what's up? [08:39:59] (03PS1) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) [08:40:31] (03CR) 10jerkins-bot: [V: 04-1] dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [08:40:53] 10Operations, 10netops: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10Marostegui) [08:41:18] (03PS1) 10Banyek: wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983) [08:42:25] (03CR) 10Marostegui: [C: 031] wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [08:42:52] (03CR) 10Banyek: [C: 032] wikireplicas: quickfix for wmf-pt-kill template config [puppet] - 10https://gerrit.wikimedia.org/r/463712 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [08:48:56] RECOVERY - Check systemd state on labsdb1010 is OK: OK - running: The system is fully operational [08:50:21] (03PS2) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) [08:54:04] <_joe_> !log rolling restart of parsoid in eqiad [08:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:12] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet'] ``` [08:55:46] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) So attempting to install the OS leads to: ``` iDRAC Settings: CBL0009: Backplane 1 connector A0 is not connected. CBL0009: Backplane 1 connector B0 is not c... [08:56:38] <_joe_> !log rolling restart of parsoid in codfw; afterwards, parsoid will connect to the MediaWiki API via HTTPS [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:46] (03PS1) 10Jcrespo: mariadb: Depool db1092, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) [09:09:38] (03PS2) 10Jcrespo: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) [09:09:58] (03CR) 10Marostegui: "Do you mean db1092 (what the commit says) or db1093 (what the patch does)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:10:11] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:10:49] (03CR) 10Marostegui: [C: 04-1] "Also depool db1093 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:15:15] 10Operations, 10Recommendation-API, 10Research, 10SCB, 10Services (next): Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10fgiunchedi) >>! In T205452#4625109, @bmansurov wrote: > Friendly ping @Joe @fgiunchedi. Can you please help with this task? Thanks! Looks like @jcre... [09:15:51] !log Set Racktables in read-only mode - T199083 [09:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:57] T199083: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 [09:18:58] (03PS3) 10Jcrespo: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) [09:22:43] 10Operations, 10Patch-For-Review: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) >>! In T205526#4624463, @MoritzMuehlenhoff wrote: > Can you please add the nickserv password to pwstore? I've added the password to `private.git`, is it needed inside pwstore too in thi... [09:25:26] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:33:12] !log test formatting sdh and sdi on ms-be2040 with crc=0 - T199198 [09:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:17] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [09:33:34] (03PS9) 10Petar.petkovic: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 [09:40:57] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) All filesystems on ms-be1040 have been reformatted with `crc=0` and almost all have finished filling up again with data. I ha... [10:03:41] godog: I was playing with the prometheus-openstack-exporter the other day [10:03:48] I plan to followup today [10:04:07] (03PS37) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [10:04:09] (03PS8) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [10:04:40] https://phabricator.wikimedia.org/T205636 this is a very serious issue [10:04:47] I know IE is evil, but... [10:05:21] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:05:23] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:07:01] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [10:07:49] yannf: https://phabricator.wikimedia.org/T205636#4626267 says will be fixed on Monday [10:08:05] so I am guessing when the deployment happens it will be resolved [10:10:07] (03CR) 10Alexandros Kosiaris: [C: 032] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [10:10:38] !log mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=fawiki --delete (T201009) [10:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:43] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [10:12:22] akosiaris, ok, I am pinging here because it is still in "Needs Triage" [10:14:19] arturo: sweet! [10:15:26] !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --prefix on all CentralAuth wikis (T201009) [10:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:38] godog: I guess the monitoring host is not labmon. to my understanding, labmon is only for things in cloudvps (i.e., virtual), and these openstack metrics are more 'physical', right? [10:20:34] arturo: eh good question, there is prometheus on labmon and IMHO since openstack-exporter is used only in wmcs metrics should live there [10:20:58] in my mind the split is more along prod/wmcs lines rather than virtual/physical [10:21:42] !log repair /dev/sdf1 /dev/sde1 on ms-be1041 - T199198 [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:47] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [10:30:06] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1030). [10:30:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) [10:36:55] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:02] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:26] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:463723|Bumping portals to master (T128546)]] (duration: 00m 59s) [10:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:30] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:42:23] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:463723|Bumping portals to master (T128546)]] (duration: 00m 56s) [10:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:41] (03CR) 10Mholloway: [C: 031] "Nice find. LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [10:50:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:51:01] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463723 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:51:10] (03CR) 10jenkins-bot: mariadb: Depool db1093, db1064 to create backup instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463715 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:51:16] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 483.80 seconds [10:52:06] commons issues? [10:52:36] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093, db1064 (duration: 00m 57s) [10:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:07] there is 2 slow queries 1-hour long on those hosts [10:53:22] :( [10:53:52] I am going to depool the host [10:54:19] is not the only slave delayed in s4 [10:55:36] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) p:05Triage>03High [10:55:43] not? [10:55:46] (03PS1) 10Jcrespo: mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725 [10:55:51] then there is another commons issue [10:56:47] yep, we have major commons isues [10:56:49] I was looking at tendril [10:57:21] I am going to depool it anyway [10:57:24] ack [10:57:28] (03CR) 10Jcrespo: [C: 032] mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725 (owner: 10Jcrespo) [10:58:15] volans: mediawiki works badly when one host is lagged [10:58:23] I know [10:58:57] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2058 (duration: 00m 57s) [10:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:02] Check 'Logstash Error rate for mw2226.codfw.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.09, After: 2.00, Threshold: 1.00) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1100). [11:00:04] kart_ and Pl217: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] * kart_ is here [11:00:46] o/ [11:00:50] I can SWAT today [11:01:01] stop SWAT [11:01:04] zeljkof: we're having some issues in the database cluster for commons (s4) [11:01:05] we have a major outage [11:01:27] jynus, volans: ok, did not even start yet [11:01:56] jynus, volans: could things get back to normal during swat, or should I just cancel this window? [11:02:25] zeljkof: too early to say at this moment, the issue started 10m ago [11:02:38] volans: ok, swat on hold [11:02:51] kart_: ^ [11:02:56] I'll keep you posted [11:03:12] PROBLEM - MariaDB Slave Lag: s4 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.83 seconds [11:03:14] it is not the servers [11:03:15] zeljkof: yep. Ping me in case we are back to normal. [11:03:16] volans: thanks! [11:03:20] it is a commons maintenance [11:03:21] kart_: will do [11:03:29] which is creating the outage [11:04:46] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.07 seconds [11:05:37] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.54 seconds [11:05:56] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.34 seconds [11:06:05] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Conn yields updated to today: ``` mc2019.codfw.wmnet: STAT conn_yields 5650... [11:06:08] (03CR) 10jenkins-bot: mariadb: Emergency depool db2058, issues? [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463725 (owner: 10Jcrespo) [11:06:34] Amir1: that is you running the maintenance [11:06:56] <_joe_> Amir1: please kill that script or I'll do it [11:06:57] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:06:59] (03CR) 10Vgutierrez: [C: 032] Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [11:07:16] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:08:12] RECOVERY - MariaDB Slave Lag: s4 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:08:36] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [11:09:17] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:09:54] <_joe_> !log killed bash runner.sh by user ladsgroup on mwmaint2001 [11:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:30] _joe_: thanks, I was afk for a bit [11:10:43] <_joe_> Amir1: heh np, it was causing an outage [11:10:44] Sorry for the trouble [11:10:53] <_joe_> btw, why didn't you use foreachwikiindblist ? [11:11:28] I will repool db2058 [11:11:34] it was not only it [11:11:36] Because the group doesn't exist so I manually composed the list [11:11:38] ack [11:11:50] (03CR) 10Mholloway: [C: 04-1] "Oops, one issue inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [11:11:51] (my ack was to jaime) [11:12:04] tgr|away: fyi ^ [11:12:45] Amir1: no danger for it to start automatically, right? [11:12:55] No [11:13:01] so I can unblock the swat [11:13:08] Yup [11:13:13] Sorry again [11:13:20] but wait for my deployment, first , zeljkof [11:13:49] (03PS1) 10Jcrespo: Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 [11:14:56] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo) [11:15:13] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo) [11:16:41] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 552.46 seconds [11:16:48] jynus: sure, let me know when I can start swat [11:16:58] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2058 (duration: 00m 55s) [11:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] PROBLEM - MariaDB Slave Lag: s5 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 440.58 seconds [11:17:02] PROBLEM - MariaDB Slave Lag: s5 on db1110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 441.13 seconds [11:17:11] PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 445.59 seconds [11:17:21] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 452.27 seconds [11:17:41] PROBLEM - MariaDB Slave Lag: s5 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 464.86 seconds [11:17:42] PROBLEM - MariaDB Slave Lag: s5 on db1100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 466.61 seconds [11:17:51] PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.90 seconds [11:17:51] PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.97 seconds [11:17:52] PROBLEM - MariaDB Slave Lag: s5 on db1113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 490.54 seconds [11:18:11] RECOVERY - MariaDB Slave Lag: s5 on db1110 is OK: OK slave_sql_lag Replication lag: 59.86 seconds [11:18:12] RECOVERY - MariaDB Slave Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [11:18:22] RECOVERY - MariaDB Slave Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [11:18:51] RECOVERY - MariaDB Slave Lag: s5 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:18:52] RECOVERY - MariaDB Slave Lag: s5 on db1100 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:18:52] RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [11:19:01] RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [11:19:01] RECOVERY - MariaDB Slave Lag: s5 on db1113 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:19:11] RECOVERY - MariaDB Slave Lag: s5 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:20:35] zeljkof: you should be ok now, but you will have lots of errors on the log in the last 1h [11:20:55] (03CR) 10jenkins-bot: Revert "mariadb: Emergency depool db2058, issues?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463728 (owner: 10Jcrespo) [11:21:06] e.g. scap may complain [11:22:00] jynus: ok, I'll try, if scap is not happy, I'll abort [11:23:38] zeljkof: you can go first with config patch, if that is quicker thing to do. [11:26:31] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [11:27:12] Pl217: around for SWAT? [11:27:36] zeljkof: I'm taking care of that patch.. [11:27:40] kart_: looks like the other person is not around for SWAT, I'll review and merge your patch, it could take a while to merge :/ [11:27:50] kart_: ah, you're in charge of both patches? [11:28:05] zeljkof: yep. updated page. [11:28:12] I'll merge both, and deploy the config first, since extension will take a while to merge [11:28:19] OK! [11:29:54] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177) [11:31:09] kart_: there is no related task for 460492? [11:31:36] (just asking, it's not required) [11:32:07] but it's rare that there is no phab task, so checking if it's a mistake [11:33:00] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177) [11:33:53] zeljkof: it is fine. It was followed up from: https://phabricator.wikimedia.org/T202286 [11:34:02] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:34:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic) [11:34:40] (03PS1) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 [11:34:41] PROBLEM - swift-account-server on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:34:46] zeljkof: commit message has all details about it. [11:34:51] PROBLEM - DPKG on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:34:51] PROBLEM - swift-account-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:34:51] PROBLEM - swift-object-server on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:34:51] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:01] PROBLEM - configured eth on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:01] PROBLEM - swift-object-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:02] PROBLEM - swift-container-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:11] PROBLEM - swift-account-reaper on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:12] PROBLEM - Disk space on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:21] PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:21] PROBLEM - swift-object-updater on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:22] PROBLEM - swift-container-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:22] PROBLEM - Check size of conntrack table on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:33] kart_: ok [11:35:39] (03Merged) 10jenkins-bot: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic) [11:35:42] RECOVERY - swift-account-server on ms-be2040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:35:42] RECOVERY - swift-account-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:35:42] RECOVERY - DPKG on ms-be2040 is OK: All packages OK [11:35:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2040 is OK: OK ferm input default policy is set [11:35:51] RECOVERY - swift-object-server on ms-be2040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:35:52] Now, what are these errors? :/ [11:36:00] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177) [11:36:01] RECOVERY - configured eth on ms-be2040 is OK: OK - interfaces up [11:36:01] RECOVERY - swift-container-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:36:02] RECOVERY - swift-object-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:36:02] (03PS2) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) [11:36:02] RECOVERY - swift-account-reaper on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:36:11] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational [11:36:11] RECOVERY - Disk space on ms-be2040 is OK: DISK OK [11:36:18] kart_: 460492 is at mwdebug2001 [11:36:21] RECOVERY - Check size of conntrack table on ms-be2040 is OK: OK: nf_conntrack is 2 % full [11:36:21] RECOVERY - swift-container-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:36:21] RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:36:21] RECOVERY - swift-object-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:36:31] zeljkof: testing.. [11:36:48] (03CR) 10jenkins-bot: Remove unused default source language config for CX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460492 (owner: 10Petar.petkovic) [11:38:22] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [11:38:42] zeljkof: looks fine. [11:38:51] kart_: ok, deploying [11:39:06] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler happy: https://puppet-compiler.wmflabs.org/compiler1002/12686/" [puppet] - 10https://gerrit.wikimedia.org/r/463732 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:41:07] !log zfilipin@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:460492|Remove unused default source language config for CX]] (duration: 00m 57s) [11:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:14] kart_: deployed [11:41:18] cool. [11:41:34] kart_: waiting for 463446 to merge... [11:43:23] Yeah. We can go and finish 5K run till it gets merge ;) [11:43:51] yes, about that time :D [11:52:22] !log install prometheus-openstack-exporte 0.0.8-3 in reprepro T203177 [11:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:26] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [11:52:30] (03PS1) 10Arturo Borrero Gonzalez: d/dirs: include /var/cache/prometheus-openstack-exporter [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463736 (https://phabricator.wikimedia.org/T203177) [11:52:33] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: refresh for 0.0.8-3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463737 [11:52:45] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/dirs: include /var/cache/prometheus-openstack-exporter [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463736 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:53:02] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/changelog: refresh for 0.0.8-3 [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463737 (owner: 10Arturo Borrero Gonzalez) [11:56:51] !log stopping db1093 to clone it to dbstore1001 [11:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:21] zeljkof: merged? [12:02:31] kart_: yes, a few seconds ago [12:02:56] zeljkof: What are these post-merge tests? [12:03:12] kart_: I don't really know :D [12:03:36] ah, coverage report [12:05:23] kart_: the patch is at mwdebug2001 [12:06:19] (03CR) 10Mholloway: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [12:07:01] zeljkof: okay. testing. [12:08:30] (03CR) 10Banyek: [C: 032] wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [12:09:38] (03PS3) 10Banyek: wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) [12:09:44] (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: ensure wmf-pt-kill service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/463734 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [12:10:47] zeljkof: looks OK. Go ahead. [12:11:00] kart_: ok, deploying [12:11:09] zeljkof: needed to confirm, so double check on few articles, so took more time.. [12:11:42] (03PS4) 10Arturo Borrero Gonzalez: cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 (owner: 10BryanDavis) [12:12:25] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 (owner: 10BryanDavis) [12:12:54] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/ContentTranslation/: SWAT: [[gerrit:463446|Fix error in CXTransclusionNode#afterRender method (T205521)]] (duration: 00m 59s) [12:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:59] T205521: Working with references is broken in CX2 - https://phabricator.wikimedia.org/T205521 [12:13:13] kart_: deployed! please check and thanks for deploying with #releng ;) [12:13:18] !log EU SWAT finished [12:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:27] zeljkof: cool. Thanks a lot! [12:15:05] !log enabling puppet on labsdb1009, labsdb1010, labsdb1011 (T183983) [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:09] T183983: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 [12:19:28] !log stopping replication on s2@dbstore20002: the tables being compressed [12:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:36] !log stopping replication on s2@dbstore20002: the tables being compressed (T204930) [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [12:20:05] 10Operations, 10Recommendation-API, 10Research, 10SCB, 10Services (next): Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) Thanks @fgiunchedi. @mobrovac no blockers left? [12:26:19] converting enwiki.categorylinks to TokuDB on host dbstrore1002 (T205544) [12:26:20] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [12:26:28] !log converting enwiki.categorylinks to TokuDB on host dbstrore1002 (T205544) [12:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:08] (03CR) 10Gehel: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [12:34:21] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add two timers [puppet] - 10https://gerrit.wikimedia.org/r/463742 (https://phabricator.wikimedia.org/T172532) [12:35:05] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: add two timers [puppet] - 10https://gerrit.wikimedia.org/r/463742 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:36:11] (03PS2) 10Elukey: yarn_http: Restrict to caches [puppet] - 10https://gerrit.wikimedia.org/r/463431 (owner: 10Muehlenhoff) [12:38:23] !log upload hfst_3.13.0~r3461-1+wmf2 to apt.wikimedia.org/jessie-wikimedia/main. T199962 [12:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:28] T199962: Apertium maintenance updates (July-September) - https://phabricator.wikimedia.org/T199962 [12:39:09] (03CR) 10Elukey: [C: 032] yarn_http: Restrict to caches [puppet] - 10https://gerrit.wikimedia.org/r/463431 (owner: 10Muehlenhoff) [12:42:32] (03PS2) 10Elukey: Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 (owner: 10Muehlenhoff) [12:43:34] (03CR) 10Elukey: [C: 032] Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 (owner: 10Muehlenhoff) [12:44:18] akosiaris: thanks! [12:46:22] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [12:46:24] 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The architecture and gaps review has been carried out as part of the [[ https://docs.google.com/document/d/1Aq-Dhq3SbRCPmQdw6jjaHvH... [12:47:11] (03PS3) 10Elukey: Filter out duplicates in allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/463253 (owner: 10Muehlenhoff) [12:47:58] (03CR) 10Elukey: [C: 032] Filter out duplicates in allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/463253 (owner: 10Muehlenhoff) [12:48:20] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [12:48:26] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The audit/list of current and future logs produce... [12:49:07] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [12:49:09] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The architecture and gaps review has been carried out as part of the [[ https://docs.google.com/doc... [12:50:06] (03PS4) 10MSantos: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) [12:51:25] (03CR) 10MSantos: "Agreed. I was a bit confused on that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [12:53:22] 10Operations, 10Wikimedia-Logstash: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [12:56:23] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10fgiunchedi) [12:58:25] 10Operations, 10Wikimedia-Logstash: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) [13:00:24] (03PS1) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447) [13:00:39] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 [13:02:06] 10Operations, 10Wikimedia-Logstash: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10fgiunchedi) [13:06:04] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi) [13:06:26] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12688/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto) [13:06:58] (03PS1) 10Volans: sre.switchdc.mediawiki: remove HHVM restart [cookbooks] - 10https://gerrit.wikimedia.org/r/463747 (https://phabricator.wikimedia.org/T199079) [13:07:23] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [13:07:54] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [13:07:59] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) [13:08:10] (03CR) 10KartikMistry: [C: 04-1] "Only to be deployed with schedule plan and testing in Beta or Labs(?)." [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447) (owner: 10KartikMistry) [13:09:13] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto) [13:10:06] 10Operations, 10Wikimedia-Logstash: Deprecate >= 50% of udp2log producers - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [13:11:06] 10Operations, 10Wikimedia-Logstash: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [13:11:08] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [13:11:10] (03PS1) 10Volans: ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522) [13:11:12] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) [13:12:12] PROBLEM - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [13:12:29] 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) [13:12:52] _joe_: jynus: FYI https://gerrit.wikimedia.org/r/c/mediawiki/core/+/463748 I didn't notice this doesn't exist in this maintenance script. I didn't write it, if I did, I would've added it [13:12:55] sorry again [13:14:06] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) [13:15:09] (03PS1) 10Jcrespo: mariadb: Add s6 instance to dbstore1001 for backup generation [puppet] - 10https://gerrit.wikimedia.org/r/463751 (https://phabricator.wikimedia.org/T201392) [13:16:15] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi) [13:16:19] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) [13:16:24] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10fgiunchedi) [13:16:34] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10fgiunchedi) [13:16:36] 10Operations, 10Mail, 10Wikimedia-Logstash, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10fgiunchedi) [13:16:37] (03CR) 10Jcrespo: [C: 032] mariadb: Add s6 instance to dbstore1001 for backup generation [puppet] - 10https://gerrit.wikimedia.org/r/463751 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [13:16:40] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) [13:17:32] Amir1: you don't need to say sorry [13:19:32] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3149.90 seconds [13:20:08] (03CR) 10Gehel: [C: 04-1] Add elasticsearch_cluster module (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:20:40] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10herron) [13:20:43] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10herron) 05Open>03Resolved a:03herron [13:22:53] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) [13:23:31] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This task and its related go... [13:23:31] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10faidon) I'm not familair with openbugbounty.org, but looking [[ https://www.openbugbounty.org/bugbounty/create/ | at their website... [13:24:02] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:24:30] (03CR) 10Mholloway: [C: 031] Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [13:25:51] RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [13:28:17] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) [13:28:43] ^ banyek dbstore1002 alerting again [13:30:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:30:45] Probably disabling lag for 24h is a good idea while doing heavy IO operations with that host, not the first time we see that :( [13:31:41] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) One more datapoint. While a memcached icinga error was still ongoing I ran m... [13:32:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12690/mw1261.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [13:33:24] (03CR) 10Gehel: "Looking good! I'll do a last pass when the parent CR is merged." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [13:34:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:34:56] marostegui: tokudb aleters are not online [13:35:11] they actually block replication [13:35:58] jynus: More reasons to downtime then :) [13:36:57] * addshore reads up [13:40:12] PROBLEM - mysqld processes on dbstore1001 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld [13:40:45] ^this is me, but it was supposed to have notifications disabled [13:41:03] I guess icinga weird puppet race condition issues [13:41:22] 10Operations, 10monitoring: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:42:05] (03PS73) 10Gehel: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:42:27] rerunning puppet again and see if icinga setup kicks-in [13:42:35] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi) [13:42:41] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10fgiunchedi) [13:42:43] 10Operations, 10monitoring: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:42:45] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) [13:43:31] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:32] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:44:01] memcache errors are still up and stable since 13:20 [13:46:08] mw2204 [13:46:52] up and stable? [13:47:26] I am wrong [13:47:29] it is not a single server [13:47:35] but happens in bursts [13:47:37] seems more intermittent, it seems the usual gadget-definition issue :( [13:47:50] T203786 [13:47:51] I don't see a pattern [13:47:51] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [13:49:02] elukey: however, the errors seem quite stable: https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen&from=now-24h&to=now [13:49:29] and now they are gone [13:50:18] is there a memcache logs dashboard? [13:50:57] https://logstash.wikimedia.org/app/kibana#/dashboard/memcached [13:51:01] (03CR) 10Gehel: [C: 032] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:51:10] elukey: thanks! [13:51:14] yes but if you see the same sustained pattern happened before [13:51:27] I am pretty sure it is this annoying issue with gadget-definition [13:51:58] !log Downtimed the slave lag monitoring on dbstore1002 while the tables getting converted (T205544) [13:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [13:52:11] banyek: thanks! [13:52:38] 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) [13:52:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:52:55] it was downtimed when i started the conversion it might be expired (marvin-bot iirc) [13:54:02] (03PS1) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760 [13:54:10] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans) [13:55:18] (03CR) 10Elukey: [C: 031] "didn't spot anything weird from pcc, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [13:56:16] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [13:56:25] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) [13:57:33] (03PS1) 10Gehel: elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351) [13:58:48] (03CR) 10Gehel: [C: 032] elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [13:59:10] (03PS2) 10Gehel: elasticsearch: default instance config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/463761 (https://phabricator.wikimedia.org/T198351) [13:59:58] (03PS1) 10Elukey: Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962) [14:00:00] bstorm_: o/ - can you ping me when you are online? [14:00:18] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) [14:00:37] (03PS2) 10Elukey: Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962) [14:00:52] I'm online (in EST right now) [14:00:58] elukey: what's up? [14:01:19] ah nice! I wanted to ask you an opinion about the systemd timers types [14:01:30] Sure [14:01:30] we can do in pvt so I will not spam in here :) [14:01:37] Yup :) [14:04:56] banyek: I am not worried, but just in case, I cannot see downtime or ack or disabling of alerts on dbstore1002 [14:05:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [14:05:19] pinging just in case it was an icinga issue (sometimes it happens) [14:06:09] godog: the "Logstash rate of ingestion percent change compared to yesterday" I am going to guess was the small incident we had a few hours ago, which sends lots of errors per second [14:06:38] (03PS1) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 [14:07:07] jynus: it was an icinga issue as it was too slow when I hit the button for setting downtime, and I navigated away from page before i submitted the downtime (it was my bad tbh.) [14:07:40] banyek: I really mean it, sometimes icinga doesn't do downtimes [14:07:44] and requires a restart [14:08:18] (03PS2) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 [14:09:10] (03CR) 10Vgutierrez: [C: 032] _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [14:09:27] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 161.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [14:10:07] (03PS1) 10Anomie: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) [14:10:36] (03PS3) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 [14:10:38] (03CR) 10Elukey: [C: 032] Add matomo1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463762 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [14:10:55] (03CR) 10jerkins-bot: [V: 04-1] Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [14:11:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) Hi @chelsyx, could you please clarify what is meant by `stats` role account? I don't see a group... [14:11:17] (03CR) 10Anomie: [C: 032] "Deploy config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [14:12:32] (03Merged) 10jenkins-bot: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [14:12:56] (03PS7) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [14:13:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [14:13:31] (03CR) 10Alex Monk: [C: 032] "rebased already-approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [14:14:00] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting MCR migration stage to write-both/read-new on mediawikiwiki (T198308) (duration: 00m 56s) [14:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] T198308: Enable MCR migration stage "write both, read new" on live systems - https://phabricator.wikimedia.org/T198308 [14:15:18] (03Merged) 10jenkins-bot: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [14:15:43] (03PS4) 10Alex Monk: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 [14:15:50] (03CR) 10Alex Monk: [C: 032] "rebase already-approved" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [14:16:10] (03CR) 10Alex Monk: [C: 04-1] "needs tests and rebase" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [14:16:37] PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:16:40] jynus: yes that's likely it [14:16:49] (03PS1) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 [14:16:58] (03CR) 10jenkins-bot: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 (owner: 10Alex Monk) [14:17:01] (03PS2) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 [14:17:05] (03PS4) 10ArielGlenn: move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 [14:17:15] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) [14:17:45] matomo1001 is failing due to me, first puppet run, apache + php [14:17:47] :P [14:17:55] (03CR) 10ArielGlenn: [C: 032] move 'multiversion' setting for addschanges dumps to 'wiki' section [puppet] - 10https://gerrit.wikimedia.org/r/463763 (owner: 10ArielGlenn) [14:18:13] (03Merged) 10jenkins-bot: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [14:18:44] any idea what the difference in specs between mwmaint1001 and mwmaint2001 are? [14:18:47] RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational [14:19:25] (03CR) 10Andrew Bogott: [C: 031] "I double-checked what got installed on labweb1001 (which is Stretch) and it looks like Newton." [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [14:19:43] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron) [14:19:51] (03CR) 10jenkins-bot: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [14:20:07] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[matomo] [14:21:40] addshore: CPU wise ? 24 vs 40 cpus if that is of any help. memory wise they are the same. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mwmaint2001&var-datasource=codfw%20prometheus%2Fops vs https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mwmaint1001&var-datasource=eqiad%20prometheus%2Fops [14:21:52] thanks! [14:22:08] trying to figure out why wikidata dispatching got 5 / 10 or something times faster with the DC switchover [14:22:14] related ticket is onId > 0 && $entityRevision = [14:22:23] .. no its not... it is https://phabricator.wikimedia.org/T205865 [14:23:47] (03CR) 10jenkins-bot: Set wgMultiContentRevisionSchemaMigrationStage read-new on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463764 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [14:23:49] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Miriam) [14:23:52] hmm, interesting [14:25:04] (03CR) 10Gehel: [C: 04-1] Add elasticsearch_cluster module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:25:16] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:25:26] 10Operations, 10monitoring: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [14:26:42] (03CR) 10Alex Monk: [C: 04-1] "per discussion on IRC, should document that while in this state the certificate will have the wrong domains, but it a) keeps the existing " [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [14:29:19] (03CR) 10Mathew.onipe: Add elasticsearch_cluster module (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [14:30:50] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac) [14:31:57] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Isaac) [14:33:53] 10Operations, 10Traffic: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) Apache Traffic Server v8.0.0 [[http://trafficserver.apache.org/downloads#8.0.0 | was released ]] on September 25th, 2018. Debian packaging work [[https://salsa.debian.org/debian/trafficserver/merge_request... [14:34:56] (03PS2) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760 [14:35:43] (03PS3) 10Alexandros Kosiaris: etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760 [14:37:10] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) [14:37:22] 10Operations, 10ops-codfw, 10DBA: db2058: Disk #1 predictive failure - https://phabricator.wikimedia.org/T205872 (10Marostegui) p:05Triage>03Normal [14:37:34] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2058 is CRITICAL: cluster=mysql device=cciss,0 instance=db2058:9100 job=node site=codfw Marostegui T205872 - The acknowledgement expires at: 2018-10-04 14:34:14. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw%2520prometheus%252Fops [14:37:42] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10fgiunchedi) [14:38:58] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Remove the diamond collector and monitrc files [puppet] - 10https://gerrit.wikimedia.org/r/463760 (owner: 10Alexandros Kosiaris) [14:39:33] 10Operations, 10Wikimedia-Logstash: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) [14:40:32] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) [14:42:48] (03CR) 10Filippo Giunchedi: [C: 032] ircecho: log exception on exit [puppet] - 10https://gerrit.wikimedia.org/r/463749 (https://phabricator.wikimedia.org/T205522) (owner: 10Volans) [14:43:01] thanks godog! :) [14:43:12] (03PS4) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) [14:43:20] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:43:20] PROBLEM - swift-object-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:43:20] PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:43:22] volans: np! thanks to you for taking the time [14:43:30] PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:43:31] ms-be1041 is me, silence expired [14:43:37] fixing [14:43:40] PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:45:19] (03CR) 10Bstorm: [C: 032] openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [14:48:49] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 [14:49:14] Amir1: apparently 1617150 passwords got prefixed on commons which is a bit scary [14:49:33] we do know from other wikis that it does not affect real users, right? [14:51:06] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) 05Open>03Resolved I checked again the server logs, everything looks good. Resolving this task. [14:53:42] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:27] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10herron) [14:56:30] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10herron) [14:58:16] (03PS1) 10Bstorm: Revert "openstack: add case for stretch and newton in client repos" [puppet] - 10https://gerrit.wikimedia.org/r/463779 [14:58:32] /away [14:59:53] (03PS3) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 [15:00:09] !log upgrade etherpad to 1.7.0-2 [15:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:21] (03CR) 10Bstorm: [C: 032] Revert "openstack: add case for stretch and newton in client repos" [puppet] - 10https://gerrit.wikimedia.org/r/463779 (owner: 10Bstorm) [15:00:44] (03PS4) 10Elukey: Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 [15:01:27] (03CR) 10Elukey: [C: 032] Revert "profile::analytics::refinery::job::data_purge: add two timers" [puppet] - 10https://gerrit.wikimedia.org/r/463768 (owner: 10Elukey) [15:03:29] Nikerabbit: are you about? [15:03:57] (03CR) 10Anomie: wiki replicas: Remove most comment joins from non-compat tables (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:04:18] (03PS38) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [15:04:20] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2046 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:04:20] (03PS9) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [15:05:11] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:32] (03CR) 10jerkins-bot: [V: 04-1] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [15:05:34] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [15:06:02] chasemp: yeah [15:06:35] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) [15:09:29] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:10:45] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) [15:12:21] (03PS1) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463781 (https://phabricator.wikimedia.org/T202962) [15:15:28] (03PS1) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) [15:16:04] !log stopping db1064 to clone it to dbstore1001 [15:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:30] (03Abandoned) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463781 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [15:19:00] tgr: yup, I already did it on Persian Wikipedia, rather medium wiki [15:19:11] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:19:20] and gave them a heads up to contact me if anything goes wrong [15:19:31] and today deleted those [15:19:41] (03PS1) 10Elukey: profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463783 (https://phabricator.wikimedia.org/T202962) [15:19:48] it's already deleted on mediawikiwiki [15:20:27] (03PS2) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) [15:21:54] (03CR) 10Elukey: [C: 032] profile::piwik: move to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/463783 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [15:22:49] (03PS1) 10Jcrespo: mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392) [15:23:53] (03CR) 10Jcrespo: [C: 04-2] "Not until all servers are back in a good state (inc. repl lag)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo) [15:27:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12696/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:28:00] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mariadb] [15:28:05] this si me again --^ [15:28:08] 10Operations, 10MediaWiki-Maintenance-scripts, 10Core Platform Team Kanban (Watching / External): sql enwik gives a poor error message when db doesn't exist - https://phabricator.wikimedia.org/T199008 (10CCicalese_WMF) [15:28:20] PROBLEM - Check systemd state on matomo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:28:47] 10Operations, 10monitoring, 10Core Platform Team Kanban (Watching / External), 10Wikimedia-Incident: Add alerts for Logstash rates in production - https://phabricator.wikimedia.org/T199479 (10CCicalese_WMF) [15:29:34] 10Operations, 10WMF-JobQueue, 10Core Platform Team Kanban (Watching / External), 10User-ArielGlenn: Use PHP7 for web requests on jobrunner servers - https://phabricator.wikimedia.org/T195392 (10CCicalese_WMF) [15:30:35] (03PS2) 10Jcrespo: mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392) [15:30:37] (03PS3) 10Jcrespo: mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) [15:30:39] (03PS1) 10Jcrespo: mariadb: Setup dbstore1001 as the backup source of s6, x1 [puppet] - 10https://gerrit.wikimedia.org/r/463788 (https://phabricator.wikimedia.org/T201392) [15:35:41] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:39:18] (03PS5) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [15:42:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:42:49] (03CR) 10Vgutierrez: api: Also handle SIGHUP signals to the API process (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [15:46:20] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational [15:46:29] RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:46:29] RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:46:29] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:46:30] RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:46:49] RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:47:59] RECOVERY - Check systemd state on matomo1001 is OK: OK - running: The system is fully operational [15:53:28] (03CR) 10Jcrespo: [C: 032] mariadb: Add x1 to dbstore1001 as a backup source [puppet] - 10https://gerrit.wikimedia.org/r/463785 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [15:55:37] (03CR) 10Vgutierrez: api: Also handle SIGHUP signals to the API process (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [15:57:23] 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) p:05Triage>03Normal [15:59:12] (03PS1) 10Andrew Bogott: Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) [16:01:19] (03PS2) 10Andrew Bogott: Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) [16:01:22] (03CR) 10Bstorm: "That should cut back nicely. It makes me really want to know exactly what "too few" looks like. I guess if api calls start getting hung " [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott) [16:02:40] (03CR) 10Bstorm: [C: 031] Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott) [16:04:07] (03CR) 10Andrew Bogott: [C: 032] Nova: reduce number of worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/463789 (https://phabricator.wikimedia.org/T202889) (owner: 10Andrew Bogott) [16:05:44] (03CR) 10Andrew Bogott: [C: 031] "This is messy but it's my mess" [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [16:07:29] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: allow queries to the nova API by novaobserver [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) [16:09:26] arturo, admin or novaobserver? [16:09:29] not sure this makes sense [16:09:37] should just be opened to all if you're allowing novaobserver in [16:11:10] Krenair: well, novaobserver is an authenticated user. I think that's different that having the API fully public? [16:11:19] no [16:11:23] it's a guest user [16:11:28] the password is public knowledge [16:11:58] the API doesn't allow unauthenticated requests AFAIK so we added that as a read-only user [16:13:41] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:15:24] Yeah, I think that "" is fine policy for those things unless there's some UI reason why we don't want the values displaying in Horizon [16:16:12] Although to be honest I don't understand what those policies (without verbs) really control [16:17:22] Are those definitely policies that just control 'list' or do they provide access to a whole suite of commands? [16:17:55] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Dzahn) [16:18:31] 10Operations, 10fundraising-tech-ops, 10netops: deploy PFW policy commit 99eb6f026 - https://phabricator.wikimedia.org/T205888 (10Jgreen) p:05Triage>03Normal [16:19:11] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Aklapper) Maybe this is about https://phabricator.wikimedia.org/S4 ? Needs clarification from @Mathew.onipe. [16:21:43] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Gehel) Yes, I can confirm, this is about S4. @Mathew.onipe will work in particular on replacing our elasticsearch servers, so he should be able to follow when those will be or... [16:23:24] 10Operations, 10Core Platform Team Kanban, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213 (10CCicalese_WMF) [16:23:34] 10Operations, 10Core Platform Team Kanban, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10CCicalese_WMF) [16:24:12] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: unprotect some nova API queries [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) [16:26:33] !log ppchelko@deploy1001 Started restart [cpjobqueue/deploy@58f9ed3]: Fix KafkaConsumer not connected error [16:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:21] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:31:49] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [16:35:45] (03PS1) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [16:46:05] (03PS2) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [16:46:34] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10CCicalese_WMF) [16:46:58] (03CR) 10Paladox: "I tested locally and works in python3 :)" [puppet] - 10https://gerrit.wikimedia.org/r/463794 (owner: 10Paladox) [16:47:51] godog volans i managed to convert ircecho to python3 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463794/ (tested locally and works) [16:47:55] it may fix https://phabricator.wikimedia.org/T205522 [16:50:48] paladox: thanks! looks simple enough, did you have a chance to test it with garbage data like the problem in T205522 ? [16:50:49] T205522: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 [16:50:59] it feels like we're putting lipstick on a pig tho [16:51:09] godog haven't tested it with garbage data. [16:51:29] 10Operations, 10PoolCounter, 10monitoring, 10Core Platform Team Kanban (Watching / External), 10Wikimedia-Incident: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10CCicalese_WMF) [16:51:31] lol [16:52:53] paladox: ack, thanks! [16:55:49] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: ensure permissions [puppet] - 10https://gerrit.wikimedia.org/r/463795 (https://phabricator.wikimedia.org/T203177) [16:56:06] godog garbage stuff like https://wm-bot.wmflabs.org/logs/%23wikidata/20180926.txt ? [16:57:00] paladox: perhaps, anything that triggers UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 92: invalid start byte [16:57:06] ok [16:57:10] see also https://phabricator.wikimedia.org/T205522#4619588 [16:57:13] gotta go [16:58:18] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo) [16:58:34] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: ensure permissions [puppet] - 10https://gerrit.wikimedia.org/r/463795 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [16:59:23] 10Operations, 10Operations-Software-Development, 10Goal: Expand Spicerack library and SRE Cookbooks - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205867 (10Volans) a:03Volans [16:59:30] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo) [17:00:04] gehel: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1700). [17:00:18] jouncebot: o/ [17:00:34] onimisionipe: ^^ let's do this together [17:00:51] gehel: alright!. bring it on! [17:01:30] 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) The change is out, the debian package works as expected, but the template generation had a typo which was fixed in https://gerrit.wikimedia.org/r/#/c/operati... [17:01:51] 10Operations: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 (10Volans) p:05Triage>03Normal [17:02:06] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans) p:05Triage>03Normal a:03Volans [17:02:40] !log stopping some mariadb instances on dbstore1001 and starting compression T201392 [17:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] T201392: Finish eqiad metadata database backup setup (s1-s8, x1) - https://phabricator.wikimedia.org/T201392 [17:03:02] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Jdrewniak) [17:03:04] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) p:05Triage>03Normal [17:03:49] 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) p:05Triage>03Normal [17:05:37] 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) p:05Triage>03Normal [17:05:49] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) a:05Volans>03None [17:05:59] 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) a:05Volans>03None [17:06:29] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093, db1064 (duration: 00m 57s) [17:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:52] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1093, db1064 to create backup instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463776 (owner: 10Jcrespo) [17:09:10] 10Operations, 10Operations-Software-Development: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10Volans) p:05Triage>03Normal [17:09:22] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 59.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:11:32] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 72.87 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:11:57] (03PS1) 10Elukey: profile::piwik::webserver: install php7.0-mbstring on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/463797 (https://phabricator.wikimedia.org/T202962) [17:13:07] (03CR) 10Elukey: [C: 032] profile::piwik::webserver: install php7.0-mbstring on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/463797 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [17:17:20] (03PS6) 10Jdlrobson: Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [17:17:42] (03PS3) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/463794 [17:26:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.95 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:27:07] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@a637583]: Test deployment for recent updater build and GUI changes. Also blazegraph updates(wdqs1009) [17:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:29] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10GTirloni) The amount of spam seems to have reduced significantly in the last couple of days. Emails blocked: ``` 2457 Blocked by DNSBL... [17:28:32] (03PS1) 10Jdlrobson: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) [17:28:34] (03PS1) 10Jdlrobson: Minerv page issues A/B test to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) [17:28:53] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@a637583]: Test deployment for recent updater build and GUI changes. Also blazegraph updates(wdqs1009) (duration: 01m 46s) [17:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.18 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:31:04] (03PS1) 10Jdlrobson: Page issues A/B test to 20% of users (Start the a/b test!) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463807 (https://phabricator.wikimedia.org/T200792) [17:39:36] (03PS1) 10Elukey: matomo: replace last piwik occurrences [puppet] - 10https://gerrit.wikimedia.org/r/463810 (https://phabricator.wikimedia.org/T202962) [17:40:08] (03CR) 10Elukey: [C: 032] matomo: replace last piwik occurrences [puppet] - 10https://gerrit.wikimedia.org/r/463810 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [17:57:24] !log push fw change on pfw3-codfw - T205888 [17:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:28] T205888: deploy PFW policy commit 99eb6f026 - https://phabricator.wikimedia.org/T205888 [17:59:26] !log push fw change on pfw3-eqiad - T205888 [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T1800) [18:00:04] stephanebisson, kostajh, Amir1, and Jdlrobson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] o/ mine is not testable [18:00:22] on maintenance script [18:00:30] I'll do the SWAT today [18:01:30] jdlrobson: You around? [18:02:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.91 seconds [18:04:20] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10Dzahn) The current "custom policy" on S4 says this: - allow administrators - allow members of any project: acl*sre-team, acl*procurement-review - allow users: @RobH I assum... [18:05:56] \o [18:06:10] RoanKattouw: i need to do 1 of those swats earlier in the window and the other later [18:06:18] Yup I saw [18:06:20] to give me some time to check the grafana graphs in bettwen [18:06:21] I'll do the first one right now [18:06:23] cool! thank you :) [18:06:30] (03CR) 10Catrope: [C: 032] Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:07:22] RoanKattouw: I'm here [18:08:04] (03Merged) 10jenkins-bot: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:09:32] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 88.87 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [18:10:13] jdlrobson: First patch is on mwdebug2001, not sure if there's any testing you can do there, but if there is please do it [18:10:26] RoanKattouw: will do won't take long [18:11:10] ok we're good to go! please merge RoanKattouw :) [18:11:25] kostajh: Yours is on mwdebug2001 (AKA mw2017 in the tool) [18:11:37] RoanKattouw: thanks, will test now [18:12:24] kostajh: While you're at it, could you test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/463680 too? [18:12:53] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 5% rate (T200792) (duration: 00m 59s) [18:12:55] RoanKattouw: sure [18:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:58] T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792 [18:15:34] RoanKattouw: both look good to me [18:16:12] (03CR) 10jenkins-bot: Enable Page issues A/B test set rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463805 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:17:48] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/PageTriage/: Ensure valid AFC option is selected (T205324, T205168); hide copyvio behind a global var and URL param (duration: 00m 57s) [18:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:54] T205168: AfC: clicking 'Set filters' link will automatically change the sorting option to 'Created date (newest)' - https://phabricator.wikimedia.org/T205168 [18:17:55] T205324: [wmf.22] AfC - Sort option 'Created date(newest)' displayed for default/reload - https://phabricator.wikimedia.org/T205324 [18:18:00] RoanKattouw: will be back from the graphs in 20 mins :) [18:21:06] RoanKattouw: just to confirm, is my patch deployed? [18:22:23] Not yet, it was only just merged [18:23:02] okay, noted [18:23:05] Deploying now [18:23:52] Thanks. Once you're done with SWAT please take a look at https://phabricator.wikimedia.org/T205904, we might SWAT it if there will be time (Maybe evening SWAT) [18:23:56] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.23/maintenance/includes/DeleteLocalPasswords.php: T201009 (duration: 00m 56s) [18:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:01] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [18:27:47] (03PS2) 10Jdlrobson: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) [18:28:27] !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=enwiki --prefix (T201009) [18:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:32] I'm monitoring s1 [18:29:47] (03CR) 10Ppchelko: [C: 031] "A related patch for change-prop https://github.com/wikimedia/change-propagation/pull/292" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [18:30:34] RoanKattouw: ready when you are for the other patch [18:30:44] it's looking good [18:30:59] (03CR) 10Catrope: [C: 032] Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:32:02] (03Merged) 10jenkins-bot: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:33:40] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 20% rate (T200792) (duration: 00m 56s) [18:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:45] T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792 [18:33:49] jdlrobson: ---^^ [18:33:51] And that's the SWAT done [18:36:02] thanks RoanKattouw :) [18:36:20] waiting for the graphs to quadruple.. [18:36:35] (03PS1) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) [18:37:16] (03CR) 10jerkins-bot: [V: 04-1] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:38:09] (03PS2) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) [18:39:03] (03CR) 10jenkins-bot: Minerv page issues A/B test to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463806 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [18:39:37] (03PS3) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) [18:40:25] (03CR) 10jerkins-bot: [V: 04-1] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:41:11] sorry for the spam above, I'm blind today [18:41:14] (03PS4) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) [18:44:44] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:47:50] (03CR) 10Dzahn: [C: 031] "looks good, since we are not using the $netbox_media variable in the bacula::director class though, let's add a warning comment that if th" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:49:08] (03PS5) 10Volans: Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) [18:49:19] (03CR) 10Volans: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:49:40] (03CR) 10Dzahn: [C: 031] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:50:25] 10Operations, 10SRE-Access-Requests: Add Mathew.onipe(onimisionipe) to procurement group - https://phabricator.wikimedia.org/T205882 (10herron) p:05Triage>03Normal [18:53:06] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10thcipriani) [18:54:53] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo) [18:55:17] 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui I'm not sure if this helps, but I'll try to better illustrate my question us... [18:58:30] (03CR) 10Ayounsi: [C: 032] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:58:39] (03CR) 10Ayounsi: [C: 031] Netbox: set media directory [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:59:11] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo) There is a base set of npm packages that are used by all services. Currently, server.js installs heap... [18:59:55] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Pchelolo) [19:11:42] !log restarting ci jenkins for new plugins [19:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:11] <_joe_> jouncebot: next [19:19:11] In 0 hour(s) and 40 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2000) [19:20:42] (03PS1) 10Herron: admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543) [19:20:44] (03PS1) 10Herron: admin: add onimisionipe to group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463836 (https://phabricator.wikimedia.org/T205543) [19:20:46] (03PS1) 10Herron: wdqs: add wdqs-roots group to wdqs common role [puppet] - 10https://gerrit.wikimedia.org/r/463837 (https://phabricator.wikimedia.org/T205543) [19:21:53] (03PS2) 10Herron: admin: create new group wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/463835 (https://phabricator.wikimedia.org/T205543) [19:23:58] !log ppchelko@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Content-negotiation filter going live T128040 [19:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:03] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [19:27:36] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Content-negotiation filter going live T128040 (duration: 03m 38s) [19:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:44] (03CR) 10Andrew Bogott: [C: 04-1] "'os_compute_api:os-hypervisors' is not a safe thing to open to the public." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [19:50:48] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [19:51:40] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) p:05Triage>03Normal [19:51:43] !log gehel@deploy1001 Started deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (wdqs1009 only) [19:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:13] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (wdqs1009 only) (duration: 00m 30s) [19:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:34] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10Services (watching): Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10thcipriani) [19:53:28] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [19:53:32] 10Operations, 10Release Pipeline, 10Epic, 10Release-Engineering-Team (Kanban), 10Services (watching): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [19:54:53] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [19:55:00] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10thcipriani) [19:58:43] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Services: TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10thcipriani) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2000). [20:00:14] (03PS1) 10Cwhite: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) [20:00:19] in ~15-20 mins [20:00:35] parsoid deploy, i mean. :) [20:04:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:13:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:16:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 55.95 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:21:25] !log gehel@deploy1001 Started deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph [20:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:22] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 54.2 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:32:01] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 70.36 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:35:25] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@a637583]: New version of WDQS GUI, updater and blazegraph (duration: 14m 00s) [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:22] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 53.51 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:37:36] !log arlolra@deploy1001 Started deploy [parsoid/deploy@8ff45db]: Updating Parsoid to 224ecde [20:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:13] godog: hey, when you have some time. It would be great if we can talk about logstash. I'm doing this: https://phabricator.wikimedia.org/T181630 [20:39:28] akosiaris: ^ [20:39:43] It's really hard to test it [20:39:58] (03CR) 10Dzahn: "nice! yea, that's a nice way to keep the customizations. and a long-standing FIXME. some minor inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:44:07] (03PS2) 10Cwhite: icinga: move init script customizations to default [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) [20:45:02] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.83 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:45:58] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@8ff45db]: Updating Parsoid to 224ecde (duration: 08m 22s) [20:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:32] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Watching / External), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) [20:55:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) a:03dduvall [20:56:47] <_joe_> arlolra, subbu please let me know if you think I should rework that patch, and in what direction; If it's not ready by tomorrow, I will revert parsoid to use HTTP for now; but I'd really like to get some version of it out before the end of the week, so that we can make the switchover more seamless [20:57:20] _joe_, we'll get it deployed tomorrow. [20:57:36] we just ran out of time for today since we were doing a sensitive deploy. [20:57:46] <_joe_> yeah, don't worry :) [20:59:38] 10Operations, 10Datacenter-Switchover-2018, 10Discovery-Search (Current work): Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10EBernhardson) [21:00:04] bawolff and Reedy: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181001T2100). [21:02:37] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) a:03ayounsi [21:02:54] 10Operations: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) a:03ayounsi [21:05:33] (03CR) 10Dzahn: icinga: move init script customizations to default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [21:06:27] (03CR) 10Dzahn: [C: 031] icinga: move init script customizations to default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [21:07:10] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10dduvall) As of this morning both Jenkins master have the Prometheus plugin installed and enabled. The... [21:08:13] (03CR) 10Dzahn: [C: 031] "i looked at the diff between the init script from the stretch distro package and the current custom file in the puppet repo. these are the" [puppet] - 10https://gerrit.wikimedia.org/r/463843 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [21:17:06] !log Updated Parsoid to 224ecde (T198504, T133673, T202666) [21:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:14] T198504: [BUG] Audio file shows incomplete toolbar in an article page - https://phabricator.wikimedia.org/T198504 [21:17:15] T202666: html2wt endpoint should handle mismatching content versions - https://phabricator.wikimedia.org/T202666 [21:17:16] T133673: Add width/height attributes to the