[01:56:32] PROBLEM - confd service on cp4013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:02] PROBLEM - puppet last run on cp4013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:22] PROBLEM - traffic-pool service on cp4013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:59:32] RECOVERY - confd service on cp4013 is OK: OK - confd is active [01:59:52] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [02:00:22] RECOVERY - traffic-pool service on cp4013 is OK: OK - traffic-pool is active [02:21:39] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 08m 25s) [02:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 8 02:27:37 UTC 2017 (duration 5m 58s) [02:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:22] PROBLEM - Varnish traffic logger - varnishmedia on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:22] PROBLEM - Confd vcl based reload on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:22] PROBLEM - Disk space on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:42] PROBLEM - SSH on cp4007 is CRITICAL: Server answer [03:22:22] RECOVERY - Varnish traffic logger - varnishmedia on cp4007 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [03:22:23] RECOVERY - Confd vcl based reload on cp4007 is OK: reload-vcl successfully ran 3h, 53 minutes ago. [03:22:23] RECOVERY - Disk space on cp4007 is OK: DISK OK [03:22:42] RECOVERY - SSH on cp4007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [03:35:32] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 2 minutes ago with 20 failures. Failed resources (up to 3 shown): Service[varnishmedia],Service[rsyslog],Package[command-not-found-data],Package[os-prober] [04:03:32] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:00:05] Amir1: Dear anthropoid, the time has come. Please deploy Cleaning ores_classification table (phab:T159753) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T0500). [05:00:05] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [05:01:26] on it [05:09:56] !log start of cleaning up ores_classification rows for two hours (T159753) [05:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:06] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [05:19:32] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 2 minutes ago with 14 failures. Failed resources (up to 3 shown): Service[salt-minion],Service[varnishmedia],Service[rsyslog],Package[command-not-found-data] [05:44:32] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 14 failures. Last run 27 minutes ago with 14 failures. Failed resources (up to 3 shown): Service[salt-minion],Service[varnishmedia],Service[rsyslog],Package[command-not-found-data] [06:02:39] 06Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3242585 (10Marostegui) @jcrespo if you want to test with s4 feel free to pick any host from: T161088 (only db1081 was done) [06:05:25] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3242587 (10Marostegui) Thanks @Cmjohnson for all the help. We can test with db1070 for as much as you like during the week (I just don't like leaving hosts down for the weekend) just let me... [06:06:41] (03CR) 10Tim Starling: "Still the same objection from me. I don't see why different wikis need different thumbnail sizes. If it's better to be 250px, then change " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (https://phabricator.wikimedia.org/T43712) (owner: 10Dereckson) [06:11:06] (03PS1) 10Marostegui: db-codfw.php: Repool db2045, depool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352543 (https://phabricator.wikimedia.org/T162539) [06:11:22] PROBLEM - Check systemd state on cp4007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:13:05] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2045, depool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352543 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:14:07] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2045, depool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352543 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:14:23] PROBLEM - confd service on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:14:23] PROBLEM - traffic-pool service on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:14:23] PROBLEM - DPKG on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:14:42] PROBLEM - SSH on cp4007 is CRITICAL: Server answer [06:17:22] PROBLEM - Disk space on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:22] PROBLEM - Confd vcl based reload on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:22] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:22] PROBLEM - Varnish traffic logger - varnishmedia on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:23] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:23] PROBLEM - salt-minion processes on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:23] PROBLEM - Freshness of zerofetch successful run file on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:24] PROBLEM - Varnish HTCP daemon on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:24] PROBLEM - Varnish traffic logger - varnishxcps on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:25] PROBLEM - Webrequests Varnishkafka log producer on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:26] PROBLEM - Varnish traffic logger - varnishxcache on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:26] PROBLEM - Freshness of OCSP Stapling files on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:27] PROBLEM - Varnish traffic logger - varnishreqstats on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:27] PROBLEM - Varnish traffic logger - varnishstatsd on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:17:28] 06Operations, 10ops-ulsfo: Degraded RAID on cp4007 - https://phabricator.wikimedia.org/T164701#3242594 (10ops-monitoring-bot) [06:18:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2045, depool db2038 - T162539 T163548 (duration: 00m 40s) [06:18:24] !Deploy alter table on wikidatawiki.wb_terms - db2038 - T162539 T163548 [06:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:28] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:18:28] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:21:22] PROBLEM - IPsec on cp4007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:22:11] (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decomission db1024 [puppet] - 10https://gerrit.wikimedia.org/r/352093 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:22:16] (03PS2) 10Marostegui: mariadb: Get ready to decomission db1024 [puppet] - 10https://gerrit.wikimedia.org/r/352093 (https://phabricator.wikimedia.org/T162699) [06:22:22] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4007 is OK: No errors detected [06:22:22] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp4007 is OK: No errors detected [06:22:22] RECOVERY - Confd vcl based reload on cp4007 is OK: reload-vcl successfully ran 2h, 41 minutes ago. [06:22:22] RECOVERY - Disk space on cp4007 is OK: DISK OK [06:22:22] RECOVERY - Varnish traffic logger - varnishmedia on cp4007 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [06:22:23] RECOVERY - Varnish traffic logger - varnishxcps on cp4007 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [06:22:23] RECOVERY - Freshness of zerofetch successful run file on cp4007 is OK: OK [06:22:24] RECOVERY - Varnish traffic logger - varnishreqstats on cp4007 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [06:22:25] RECOVERY - dhclient process on cp4007 is OK: PROCS OK: 0 processes with command name dhclient [06:22:25] RECOVERY - Webrequests Varnishkafka log producer on cp4007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [06:22:26] RECOVERY - Varnish HTCP daemon on cp4007 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [06:22:26] RECOVERY - configured eth on cp4007 is OK: OK - interfaces up [06:22:27] RECOVERY - Freshness of OCSP Stapling files on cp4007 is OK: OK [06:22:27] RECOVERY - MD RAID on cp4007 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [06:22:42] (03CR) 10jenkins-bot: db-codfw.php: Repool db2045, depool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352543 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [06:22:42] RECOVERY - SSH on cp4007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [06:26:44] I can see a OOM killer entry logged for varnish in the dmesg (cache upload) [06:26:44] (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Decommission db1024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352087 (https://phabricator.wikimedia.org/T162699) [06:28:11] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Decommission db1024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352087 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:29:11] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352087 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:30:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Decommission db1024 - T162699 (duration: 00m 39s) [06:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:24] T162699: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699 [06:30:39] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Decommission db1024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352087 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:31:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Decommission db1024 - T162699 (duration: 00m 39s) [06:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:38] (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db1024 [software] - 10https://gerrit.wikimedia.org/r/352088 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:32:32] (03Merged) 10jenkins-bot: s2.hosts: Remove db1024 [software] - 10https://gerrit.wikimedia.org/r/352088 (https://phabricator.wikimedia.org/T162699) (owner: 10Marostegui) [06:39:00] (03PS3) 10Muehlenhoff: Strip nfs-common/rpcbind during jessie base installation [puppet] - 10https://gerrit.wikimedia.org/r/352105 (https://phabricator.wikimedia.org/T106477) [06:41:11] (03CR) 10Muehlenhoff: [C: 032] Strip nfs-common/rpcbind during jessie base installation [puppet] - 10https://gerrit.wikimedia.org/r/352105 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [06:42:32] RECOVERY - Check systemd state on cp4013 is OK: OK - running: The system is fully operational [06:44:54] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3242621 (10Marostegui) [06:45:00] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3242638 (10Marostegui) [06:45:01] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 2 minutes ago with 9 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Mount[/srv/sda3],Mount[/srv/sdb3],Mount[/var/lib/nginx] [06:45:02] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3242637 (10Marostegui) [06:46:41] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:47:01] RECOVERY - Check systemd state on cp4007 is OK: OK - running: The system is fully operational [06:47:18] (03CR) 10Marostegui: "I don't really have much to say as I don't really know phabricator data model :|" [puppet] - 10https://gerrit.wikimedia.org/r/352125 (https://phabricator.wikimedia.org/T164297) (owner: 10Aklapper) [06:49:05] (03CR) 10Marostegui: [C: 031] db: Comment db1015 being defective [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352188 (owner: 10Jcrespo) [06:53:42] (03CR) 10Marostegui: [C: 04-1] "The IPs are wrong" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [07:01:23] I need one more hour to clean up so it hits around 10M rows [07:01:36] Amir1: sure [07:01:42] Thanks! [07:01:50] 7M done by now [07:10:32] 06Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3242675 (10MoritzMuehlenhoff) [07:13:01] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:31:34] (03PS1) 10Muehlenhoff: Fix proxy configuration for docker image build [puppet] - 10https://gerrit.wikimedia.org/r/352545 [07:32:34] (03CR) 10jerkins-bot: [V: 04-1] Fix proxy configuration for docker image build [puppet] - 10https://gerrit.wikimedia.org/r/352545 (owner: 10Muehlenhoff) [07:33:30] (03CR) 10ArielGlenn: "You need to know which finger is right before the salt minion is actually installed on the client; there's nothing in puppet that can do t" [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [07:38:17] (03PS2) 10Muehlenhoff: Fix proxy configuration for docker image build [puppet] - 10https://gerrit.wikimedia.org/r/352545 [07:51:34] 06Operations, 10hardware-requests: Unmanaged switch for eqiad frack - https://phabricator.wikimedia.org/T164561#3242719 (10ayounsi) >>! In T164561#3238920, @RobH wrote: > Why is a standard msw needed if no non frack hosts are in frack? (Is it ONLY used for the pdu uplink?) Seems silly to have a msw sitting s... [08:01:47] (03CR) 10Gilles: [C: 031] webperf: Decom asset-check service [puppet] - 10https://gerrit.wikimedia.org/r/352302 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [08:06:47] !log clean up party of ores_classification is done now (T159753) 10M rows deleted. Current number of rows: 76,586,043 [08:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:56] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [08:15:33] (03PS1) 10Jdlrobson: Add Bengali logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352547 (https://phabricator.wikimedia.org/T164652) [08:25:11] !log swift eqiad-prod: ms-be1028/ms-be1039 container/account full weight - T160640 [08:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [08:39:25] (03PS4) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss [puppet] - 10https://gerrit.wikimedia.org/r/350555 [08:39:27] (03PS1) 10Giuseppe Lavagetto: uwsgi::app: add reload capability in systemd [puppet] - 10https://gerrit.wikimedia.org/r/352551 [08:39:29] (03PS1) 10Giuseppe Lavagetto: service::uwsgi: fix logrotate rules [puppet] - 10https://gerrit.wikimedia.org/r/352552 [08:39:40] <_joe_> uh well I should've rebased [08:39:56] <_joe_> but anyways, the two new patches are for elukey and akosiaris [08:41:25] (03CR) 10Giuseppe Lavagetto: graphite::alerts: add alerting on session loss (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [08:47:53] !log reboot kafka1013 for kernel upgrades [08:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:25] jynus: marostegui: Hey, I'm deploying today for https://phabricator.wikimedia.org/T151681 (in two hours) I want to know if there is any particular monitors for number of connections to the master so I can check if the situation is improving or worsening? [08:51:31] Amir1, loop us in and we can check [08:51:48] on grafana there are several things we can look at [08:52:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:52:44] db1063 statistics: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1063 [08:52:49] mostly processlist [08:52:51] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:52:57] connections [08:53:08] connection problems [08:53:16] ah kafka1012 is due to the kafka1013 reboot, fixing it [08:53:41] and I can have a look at performance_schema statustics (which are not yet exposed outside of the database) [08:54:05] also innodb purge lag- I expect an impact there [08:54:21] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:54:22] Thanks [08:54:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:54:51] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [08:54:51] PROBLEM - Check systemd state on kafka1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:54:56] more important is that nothing breaks on wikidata because of it [08:55:13] I can check the wbterms table writes? [08:55:13] Looking at mysql dashboard, there are lots of interesting things there, I didn't know there is support for MyISAM [08:55:28] there is, but you should not even thing about it [08:55:37] MyISAM == devil [08:55:51] RECOVERY - Check systemd state on kafka1014 is OK: OK - running: The system is fully operational [08:55:53] :D [08:55:59] !log restart Kafka mirror maker on kafka101[24] [08:56:02] jynus: no, we should check only wb_changes_stats tables [08:56:02] we only have that graph because it is used on labs [08:56:05] Amir1, thanks [08:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:21] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:56:21] sorry, the correct tabel: "wb_changes_dispatch" [08:56:35] I see [08:56:41] Amir1, are there specific stats on the new system? [08:56:50] not a blocker [08:56:57] but it would be a nice thing to have [08:57:09] like "how many processes are waitin/executed", etc [08:57:10] not yet. But let me check if redis can support that [08:57:22] alsmost surely yes [08:57:29] this is not for us [08:57:31] I think graphite should have some sort of stats [08:57:36] this is for you [08:57:56] in case of problems, debugging stats will be most certainly needed [08:58:05] look at all the isssues with the job queue [08:59:52] https://grafana.wikimedia.org/dashboard/db/redis?refresh=5m&orgId=1 [09:00:00] This is not super helpful [09:01:23] That's better [09:01:24] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=rdb1001&var-network=eth0 [09:04:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [09:04:51] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:04:54] checking... [09:05:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [09:05:51] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [09:09:02] by the way, we are dropping "entity_per_page" table soon [09:17:00] !log swift codfw-prod: more ms-be2001/ms-be2012 decom - T162785 [09:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:08] T162785: Decomission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785 [09:25:46] !log rolling restart of cassandra on aqs* hosts to pick up new jvm upgrades [09:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:24] !log swift eqiad-prod: ms-be1028/ms-be1039 object weight 2000 - T160640 [09:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [09:31:51] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [09:34:51] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:36:21] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [09:37:20] (03PS6) 10Filippo Giunchedi: Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) [09:37:22] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:39:39] (03CR) 10Filippo Giunchedi: [C: 032] Send 5xx from kafkatee to logstash [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [09:42:49] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3243037 (10ayounsi) Switch port configured. SRXs in a cluster configuration have a different way of numbering ports. What might looks like 1/0/8 is actually 2/0/8 (still on pfw1) ``` ayounsi@p... [09:44:11] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:01] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.073 second response time [09:50:21] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [09:54:08] !log upgrading mw1261-mw1264 to Linux 4.9 [09:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:41] PROBLEM - HP RAID on ms-be1028 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [09:58:21] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:02:41] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:03:11] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:03:48] godog: I guess those are due because of the new weight hence they are more overloaded? ^^^ [10:04:01] PROBLEM - swift eqiad-prod object availability on graphite1001 is CRITICAL: CRITICAL: 6.25% of data under the critical threshold [90.0] [10:05:51] PROBLEM - HP RAID on ms-be1031 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [10:07:18] volans: yeah, I'll silence those [10:08:51] ok [10:25:30] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10MoritzMuehlenhoff) [10:25:58] ACKNOWLEDGEMENT - Host mw1264 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T164725 [10:33:11] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:33:36] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3243214 (10MoritzMuehlenhoff) a:03Papaul [10:45:24] (03PS1) 10Addshore: Put Cognate in write mode for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) [10:45:40] (03CR) 10Addshore: [C: 04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352569 (https://phabricator.wikimedia.org/T164407) (owner: 10Addshore) [10:50:21] RECOVERY - HP RAID on ms-be1030 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:59:30] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3243249 (10fgiunchedi) Basic functionality kafkatee -> logstash has been added, left to do: [] `jq` ignores `EPIPE` when writing, thus i... [11:00:01] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:00:04] Amir1: Dear anthropoid, the time has come. Please deploy Using redis in dispatching changes in Wikibase (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1100). [11:00:50] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3243253 (10elukey) Started a dashboard in https://logstash.wikimedia.org/app/kibana#/visualize/edit/Webrequest:-all-50X-events [11:03:39] Sorry for being late. I will deploy in one sec [11:05:15] (03CR) 10Ladsgroup: [C: 032] Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:05:23] (03CR) 10jerkins-bot: [V: 04-1] Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:05:34] _joe_ marostegui jynus: I'm deploying the redis dispatching changes [11:05:51] RECOVERY - HP RAID on ms-be1031 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:06:24] Amir1: ok! [11:07:56] (03PS3) 10Ladsgroup: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) [11:08:29] (03CR) 10Ladsgroup: "PS3 is manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:10:59] (03CR) 10Ladsgroup: [C: 032] Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:12:19] (03Merged) 10jenkins-bot: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:12:31] (03CR) 10jenkins-bot: Use redis lock manager for dispatching in all production repo instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347395 (https://phabricator.wikimedia.org/T159826) (owner: 10Ladsgroup) [11:14:32] !log start of ladsgroup@tin:/srv/mediawiki-staging$ scap sync-file wmf-config/Wikibase-production.php 'USe redis lockManager for change dispatching (T159826)' [11:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:42] T159826: Use redis-based lock manager in dispatch changes in production - https://phabricator.wikimedia.org/T159826 [11:15:21] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: USe redis lockManager for change dispatching (T159826) (duration: 00m 56s) [11:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:22] I might be super optimist but number of processes in s5 master is already decreased badly: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&panelId=37&fullscreen [11:20:01] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:40:35] Terbium fingerprint is not in https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [11:40:50] It gives MIM error to me when I'm trying to connect [11:41:01] (03PS1) 10Reedy: Re-instate "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352572 (https://phabricator.wikimedia.org/T164145) [11:41:08] (03PS2) 10Reedy: Re-instate "Run Pdf Processors in firejails" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352572 (https://phabricator.wikimedia.org/T164145) [11:42:02] (03PS2) 10Alexandros Kosiaris: Assign IPs for ganeti2007, ganeti2008 [dns] - 10https://gerrit.wikimedia.org/r/351303 (https://phabricator.wikimedia.org/T164011) [11:42:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign IPs for ganeti2007, ganeti2008 [dns] - 10https://gerrit.wikimedia.org/r/351303 (https://phabricator.wikimedia.org/T164011) (owner: 10Alexandros Kosiaris) [11:44:21] Amir1: it was reimaged few days ago (see SAL), let me add those too [11:44:29] Thanks [11:44:54] (03PS1) 10Alexandros Kosiaris: Introduce ganeti2007, ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/352573 [11:45:46] Amir1: done [11:46:23] Thanks [11:46:28] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce ganeti2007, ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/352573 (owner: 10Alexandros Kosiaris) [11:46:32] (03PS2) 10Alexandros Kosiaris: Introduce ganeti2007, ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/352573 [11:46:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce ganeti2007, ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/352573 (owner: 10Alexandros Kosiaris) [11:47:09] yup, they are the same [11:48:30] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3243410 (10akosiaris) [11:48:32] 06Operations, 10ops-codfw, 10netops: codfw: ganeti2007-ganeti2008 switch port configuration - https://phabricator.wikimedia.org/T164594#3243407 (10akosiaris) 05Open>03Resolved a:03akosiaris Done. ports added to interface-range ganeti which sets trunk, vlan. Added descriptions as well, resolving. [11:49:58] (03PS1) 10Ladsgroup: Do not rebuild or make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) [11:55:21] (03CR) 10ArielGlenn: "This looks fine to me but it shouldn't be deployed until the current run completes (don't change job config in the middle of a run)." [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [12:01:11] _joe_: I'm monitoring https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=rdb1001&var-network=eth0&from=now-30m&to=now and nothing suspicious came up (still one hour after the deploy) but can you check redis logs to see if wikibase is using redis as lock manager or not? [12:01:23] I wanted to check but logstash wasn't helpful [12:01:44] (I would love to do "redis-cli monitr" somewhere [12:02:47] <_joe_> Amir1: the lockmanager is on rdb1001? [12:02:51] <_joe_> I don't think so [12:03:01] <_joe_> I think it's on some mc* hosts actually [12:03:06] <_joe_> check mediawiki-config [12:03:34] last time I checked it was on rd1001 [12:03:37] but let me check again [12:04:51] it says rdb1 and it translates to 'rdb1' => '10.64.0.180' [12:05:26] (03PS2) 10Alexandros Kosiaris: Create kubemaster.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/351836 (https://phabricator.wikimedia.org/T162040) [12:05:42] <_joe_> mc1001.eqiad.wmnet [12:05:44] <_joe_> :) [12:05:51] <_joe_> rdb1 is just a label [12:06:02] <_joe_> we might want to add some comments there [12:06:02] shard01 [12:06:10] Yeah :D [12:06:30] hopefully we are not load-balacing the exclusive locks? [12:07:18] Amir1, I can check how many GET_LOCKs we are running lately [12:07:19] jynus: since it's being used for filebackend I doubt that [12:07:35] jynus: it would be great [12:07:37] I know, but do not know enough [12:07:47] to check it isn't :-) [12:07:51] <_joe_> I think we don't do locking that way :P [12:08:04] <_joe_> anyways, I have to work on something els [12:08:07] <_joe_> *else [12:08:20] GET_LOCK on the database mastert [12:08:34] with was indeed the old way of doing locks for wikidata [12:08:42] and was what I asked to change :-) [12:08:56] (althugh they new already it was suboptimal) [12:09:45] Amir1, changedispatcher is done by a job? [12:09:46] <_joe_> sorry, am I getting it right? [12:10:04] <_joe_> we deployed a change that moved locks for wikidata to redis? [12:10:06] jynus: nope, cronjob [12:10:10] https://github.com/wikimedia/puppet/blob/c74da95671f7f2c0f77de500875ea5e25beca73c/modules/mediawiki/manifests/maintenance/wikidata.pp [12:10:11] <_joe_> who did that? [12:10:28] I assume Amir1 did [12:10:38] _joe_: I did, we talked about it before the switchover [12:10:41] with you [12:10:55] <_joe_> yeah I didn't understand it was for the whole wikidata [12:10:58] and I assume you were in the loop- I wasn't [12:11:06] <_joe_> not really :) [12:11:29] <_joe_> also, I was under the impression this was a new feature [12:11:43] I wasn't because they were migrating away from databases, so I didn't followed it [12:12:05] <_joe_> Amir1: it really says 'rdb1' => '10.64.0.180' ? [12:12:19] <_joe_> if that's the case and the code is updated, that's very wrong [12:12:35] amir@amir-GL552VW:~/mediawiki-config$ grep -ir "rdb1" . [12:12:35] ./wmf-config/ProductionServices.php: 'rdb1' => '10.64.0.180', [12:13:22] <_joe_> elukey: didn't you fix those? [12:13:23] <_joe_> FFS [12:13:39] Amir1, I would suggest to revert for now [12:13:50] <_joe_> no, just the first one is wrong [12:13:55] <_joe_> so it should still work [12:14:09] <_joe_> I'm fixing it [12:14:57] _joe_: one thing, this is for dispatching and it's just ten connections (at the most) to redis. [12:15:00] <_joe_> sorry, it's correct, it's .80 now [12:15:04] <_joe_> Amir1: you tricked me [12:15:12] <_joe_> 'rdb1' => '10.64.0.80', [12:15:25] oh, my local repo wasn't updated [12:15:32] sorry about that [12:16:06] <_joe_> so, everything is set up correctly [12:16:28] phew [12:16:34] _joe_, this is indeed a new feature [12:16:52] <_joe_> jynus: oh ok so it's not something that worked on the db forever and that we migrated [12:16:59] well, yes [12:17:06] but the mediawiki code is new [12:17:10] is not just a config change [12:17:18] from X backend to U backend [12:17:20] just to note, we extensively tested it in several places (teswikis, two instances in labs, localhost) and by three persons [12:17:37] as far as I know, Amir1 implement the redis code, right? [12:17:43] jynus: yup [12:17:43] from 0? [12:17:59] so not a new feature as in new functionality [12:18:01] jynus: We are using mediawiki LockManager [12:18:04] but new backend [12:18:28] "worked on the db" is relative [12:18:43] and I offered to continue using a db, just not the master [12:18:51] but they didn't want to [12:19:20] <_joe_> why? [12:19:44] why they didn't want them or why I offered that, or why I wanted them to change? [12:19:47] <_joe_> anyways, redises are fine [12:20:06] <_joe_> from my prespective, it's allright, let's see on the app side later [12:20:19] <_joe_> I gtg now, I will be back a bit later [12:20:24] ok, bye [12:21:19] o/ [12:23:07] Amir1, is there a way to check the mediawiki functionality working as intended? [12:23:43] jynus: depends on how well [12:23:55] but the most important one is this [12:24:05] https://www.wikidata.org/wiki/Special:DispatchStats [12:24:20] when the lag is zero minutes, it means it's dispatching correctly [12:25:16] but it might be dispatching several times (in that case we should see lots of duplicate rows in wb_changes tables in clients [12:25:17] but technically, dispatching could happen if a lock failed, right? [12:25:23] yeah [12:25:32] but should we? [12:25:46] shouldn't it just have write contention? [12:25:50] if a lock fails it rollbacks the whole transaction [12:26:07] and moves on to another client [12:28:43] and if a lock fails and says it's free (even though it's not) we will end up with duplicate rows in db (we can check and if we see them, it means the lock manager failed) [12:29:04] do we? [12:29:16] not (yet) [12:29:17] isn't there a unique key or something limiting that? [12:29:51] do you have the main write the change dispatched does? [12:30:02] or just a link to the code [12:30:32] sure [12:30:32] https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/c9fdba2ff1936461b5d6a7544aa2d68b2cdf1ac1/repo/includes/Store/Sql/LockManagerSqlChangeDispatchCoordinator.php [12:30:47] well, it's https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/9c458d3c94ab8a3818f2f9537a56e0d8c68019ec/repo/maintenance/dispatchChanges.php [12:31:16] and it run this class and that runs this class: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/c9fdba2ff1936461b5d6a7544aa2d68b2cdf1ac1/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php [12:31:29] These three are the most important ones [12:32:14] (03Abandoned) 10Fdans: Add AutomatedRequest to schema black list [puppet] - 10https://gerrit.wikimedia.org/r/350235 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [12:32:29] apparently id does SELECT for update [12:32:31] *it [12:32:51] so it should not create duplicates, only contention [12:32:56] if it fails [12:33:14] do you mean deadlocks? [12:33:39] maybe, but not necesarilly [12:33:50] normally innodb_lock_wait [12:34:14] I think we can monitor that [12:34:27] it is on https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&panelId=19&fullscreen [12:35:06] and https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&panelId=24&fullscreen [12:37:30] both look normal [12:37:50] (Note: I deployed around 11 UTC) [12:37:51] joe maybe concerned about "moving things to redis" - I think it shares code or similarities with the image workflow, so we are not limited by that [12:38:02] Amir1, correct me if I am wrong [12:38:22] jouncebot: next [12:38:22] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300) [12:38:37] and if a redis alternative existed that was better, a migration could be done, right? [12:38:39] yeah, we are completely using their infrastructure [12:38:56] without doubling the efforts of migrating images [12:39:03] jynus: yes, the lock manager is so flexible we can even move it to memcached [12:39:13] I am not saying we have to [12:39:23] I know :) [12:39:31] but joe most likely is concerned with redis and the job queue model [12:39:45] but you are using standarized apis [12:40:40] Yes [12:40:43] Exactly [12:41:15] I would have looped anyone on ops more [12:41:26] before deploy- I was in a meeting [12:42:26] and I didn't even know this was happening today [12:43:49] and you probably didn't know there were issues with the redis locking during the failover [12:43:56] Point taken, I was explicitly contacted joe and some other people, the response I got was "It's okay but do it after the switchover" [12:44:03] he he [12:44:05] yeah [12:44:22] we sometimes get too much things going on [12:44:25] *many [12:45:34] (03PS1) 10Alexandros Kosiaris: lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) [12:45:36] (03PS1) 10Alexandros Kosiaris: Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 [12:45:54] (03PS1) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [12:46:34] yeah, I know a jobrunner problem (it's caused wtachlist entries get multiplied) but I wasn't aware of redis lock manager problem in switchover [12:46:57] (03CR) 10jerkins-bot: [V: 04-1] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans) [12:47:13] (03PS3) 10Gehel: elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 [12:49:34] (03CR) 10Gehel: [C: 032] elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 (owner: 10Gehel) [12:50:08] it was commented briefly here: https://lists.wikimedia.org/pipermail/wikitech-l/2017-April/087999.html [12:50:26] you are not supposed to know it, that is why I said to involve ops [12:51:51] Thanks [12:52:17] jynus: Do you know where I can shard01 metrics? It wasn't in server board in grafana [12:52:19] (03PS2) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) [12:55:57] <_joe_> Amir1: https://grafana.wikimedia.org/dashboard/db/redis?refresh=5m&orgId=1&var-cluster=mc1*&var-server=All [12:56:37] Thanks [12:58:22] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified Redis Job Queues replication model - https://phabricator.wikimedia.org/T164738#3243584 (10elukey) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300). Please do the needful. [13:00:05] Urbanecm, kart_, Jdlrobson, and Gilles: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:33] o/ [13:01:37] here [13:01:45] \o [13:01:53] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243614 (10elukey) [13:02:28] * kart_ here [13:02:41] anybody wants to do swat? or should I do it? :) [13:05:02] Urbanecm: around for swat' [13:05:30] looks like nobody is taking swat, so... [13:05:33] I can swat today! [13:05:38] (03PS2) 10Andrew Bogott: Labs/salt: User sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 [13:05:39] zeljkof: thanks :) [13:06:18] (03PS3) 10Andrew Bogott: Labs/salt: Use sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 [13:06:19] jdlrobson: since Urbanecm does not respond, I will deploy your changes first [13:06:37] * zeljkof is reviewing 351664 [13:07:16] jdlrobson: if I deploy your commits to mwdebug1002, can you test them there? [13:07:17] (03CR) 10jerkins-bot: [V: 04-1] Labs/salt: Use sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [13:07:19] !log restarting cassandra on restbase2001 to pick up openjdk security updates [13:07:20] sure [13:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:28] zeljkof: yup i can [13:07:40] jdlrobson: ok, will ping you as soon as each commit is there [13:08:38] (03PS3) 10Zfilipin: Wikivoyage should show related pages in footer of skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) (owner: 10Jdlrobson) [13:08:42] (03PS4) 10Andrew Bogott: Labs/salt: Use sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 [13:09:41] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [13:10:01] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:10:11] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:10:30] (03CR) 10jerkins-bot: [V: 04-1] Labs/salt: Use sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [13:10:41] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.000 second response time on 10.192.16.162 port 9042 [13:11:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) (owner: 10Jdlrobson) [13:11:11] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 127 days) [13:11:51] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:11:52] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [13:12:13] (03PS5) 10Andrew Bogott: Labs/salt: Use sha256 salt finger with newer minions. [puppet] - 10https://gerrit.wikimedia.org/r/351914 [13:12:51] RECOVERY - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-b valid until 2017-09-12 15:13:28 +0000 (expires in 127 days) [13:12:51] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.000 second response time on 10.192.16.163 port 9042 [13:13:25] (03Merged) 10jenkins-bot: Wikivoyage should show related pages in footer of skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) (owner: 10Jdlrobson) [13:13:39] (03CR) 10jenkins-bot: Wikivoyage should show related pages in footer of skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) (owner: 10Jdlrobson) [13:13:41] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:14:23] (03PS4) 10Zfilipin: pagePreviews: Fix NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [13:14:41] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2017-09-12 15:13:30 +0000 (expires in 127 days) [13:15:37] 06Operations, 06Labs: Stretch vs. Salt - https://phabricator.wikimedia.org/T164595#3239263 (10faidon) Why are you trying to install the jessie version to stretch boxes? In the test box that we've had in prod, the stretch version of salt (as shipped by Debian) works fine with our jessie master. There were some... [13:16:07] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10GWicke) Note that implementation work for {T157088} has started, which is designed to fully replace the Redis backend, and avoid the associate... [13:17:33] 06Operations, 10Traffic: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3239748 (10faidon) Undeploying cache_misc sounds unfortunate… Why not keep the existing 4-year old ulsfo hardware for cache-misc still, perhaps keeping some of the other old servers in there for parts? [13:18:33] jdlrobson: 351664 is at mwdebug1002, please test and let me know if I can deploy it [13:18:46] related pages one? [13:18:49] okay looking into it [13:19:09] jdlrobson: yes, "Wikivoyage should show related pages in footer of skin" [13:19:16] I'm online but reading, ping me if redis/wikidata/dispatching or anything like that went crazy [13:20:12] zeljkof: it's not quite working as expected.. let me check something [13:20:37] jdlrobson: ok [13:21:00] zeljkof: it looks like i might need a minor follow up.. that okay? [13:21:09] jdlrobson: sure [13:21:21] so, this one should not be deployed yet? [13:21:28] nope not yet [13:21:30] should I wait until you push the follow up? [13:21:31] i made a mistake in my patch :/ [13:21:40] should it be reverted? [13:22:13] not necessarily [13:22:17] (03PS1) 10Jdlrobson: Use correct key for Wikivoyage related pages enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352589 [13:22:24] not sure what's easier.. just merging the above [13:22:38] or you reverting that and me creating a new patch with both changes [13:22:56] jdlrobson: looking at 352589... [13:25:00] jdlrobson: 352589 looks good to me, should I merge it and let you know when it is at mwdebug1002? [13:25:01] (03PS1) 10Gehel: elasticsearch - upgrade reprepro to elasticsearch 5.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/352590 (https://phabricator.wikimedia.org/T163705) [13:25:36] thanks zeljkof [13:26:44] jdlrobson: ok, merging... [13:27:01] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352589 (owner: 10Jdlrobson) [13:30:01] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:30:20] (03Merged) 10jenkins-bot: Use correct key for Wikivoyage related pages enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352589 (owner: 10Jdlrobson) [13:30:33] (03CR) 10jenkins-bot: Use correct key for Wikivoyage related pages enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352589 (owner: 10Jdlrobson) [13:32:40] jdlrobson: 352589 is at mwdebug1002, please test [13:32:42] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243783 (10elukey) @GWicke wonderful news, thanks! We are ~5/6 months away from the new solution though and I am still convinced that it might be good to... [13:32:48] on it! [13:33:32] (03PS5) 10Zfilipin: pagePreviews: Fix NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [13:33:35] zeljkof: looks great! [13:33:39] you can merge the wikivoyage one [13:33:49] jdlrobson: ok, deploying [13:36:20] uh oh, scap errors [13:36:20] will paste [13:36:40] uh [13:36:42] (03PS1) 10Giuseppe Lavagetto: Update debian files [software/service-checker] - 10https://gerrit.wikimedia.org/r/352594 [13:36:44] (03PS1) 10Giuseppe Lavagetto: Fix data structure for jobs [software/service-checker] - 10https://gerrit.wikimedia.org/r/352595 [13:36:44] 500 Notice: Undefined variable: wmgRelatedArticlesUseCirrusSearch in /srv/mediawiki/wmf-config/CommonSettings.php on line 2899 [13:37:02] jdlrobson: ^ [13:37:18] uh, I should have deployed the other file first, will try [13:38:22] zeljkof: let me update it [13:38:50] I have deployed common- before initialise-settings [13:39:07] but deploying initialise-settings also returns scap error :( [13:39:22] 13:37:56 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings.php', 'tin.eqiad.wmnet'] on mw1264.eqiad.wmnet returned [255]: ssh: connect to host mw1264.eqiad.wmnet port 22: No route to host [13:39:33] zeljkof: feel free to skip this one. It's not urgent [13:39:44] ill try it in a different swat window and i dont want to hold up other deploys [13:39:52] 2017-05-08 09:54 upgrading mw1261-mw1264 to Linux 4.9 [13:40:07] (although im not sure what's going on here) [13:40:14] moritzm: I am having trouble with mw1264, can you help? [13:40:21] zeljkof: that host is down, needs dc-ops ibvestigation [13:40:25] looks like you have been upgrading it today [13:40:35] https://phabricator.wikimedia.org/T164725 [13:40:43] <_joe_> moritzm: so set it to pooled: "inactive" [13:40:50] <_joe_> that will remove it from the dsh group [13:40:53] <_joe_> I'll do it [13:40:59] yep, forgot about that, will do it now [13:41:06] <_joe_> ok [13:41:13] <_joe_> I'll hold off [13:41:19] _joe_, moritzm: thanks, trying to figure out how to continue with scap [13:41:23] I mean swat [13:41:46] <_joe_> zeljkof: well you should be able to continue [13:41:51] let me know when I can retry scap sync-file [13:41:57] _joe_: thanks! [13:42:43] !log depooled mw1264 (set to inactive), since the host is down (T164725) [13:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:54] T164725: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725 [13:44:04] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:351664|Wikivoyage should show related pages in footer of skin (T164391)]] (duration: 00m 39s) [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:13] T164391: Switch Related Articles on Wikivoyage from sidebar to footer - https://phabricator.wikimedia.org/T164391 [13:45:15] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:351664|Wikivoyage should show related pages in footer of skin (T164391)]] (duration: 00m 39s) [13:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:10] _joe_: thanks, everything works fine now [13:46:20] <_joe_> cool [13:46:28] yay [13:46:36] jdlrobson: 351664 and 352589 are deployed, please test on production [13:46:40] on it! [13:47:00] (03CR) 10Giuseppe Lavagetto: [C: 032] Update debian files [software/service-checker] - 10https://gerrit.wikimedia.org/r/352594 (owner: 10Giuseppe Lavagetto) [13:47:12] I pushed the files in the wrong order and caused some trouble, should be fixed now (Notice: Undefined variable: wmgRelatedArticlesShowInSidebar in /srv/mediawiki/wmf-config/CommonSettings.php on line 2893) [13:47:13] !log labservices1002 'touch /forcefsck && sudo reboot' [13:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:32] 06Operations, 06Labs: Stretch vs. Salt - https://phabricator.wikimedia.org/T164595#3243832 (10Andrew) 05Open>03Invalid There was some discussion on IRC about this, during which we were briefly in agreement that we should be running the same minion on all hosts. That's clearly not going to happen so I've m... [13:47:41] 06Operations, 06Labs: Stretch vs. Salt - https://phabricator.wikimedia.org/T164595#3243834 (10akosiaris) >>! In T164595#3243687, @faidon wrote: > Why are you trying to install the jessie version to stretch boxes? In the test box that we've had in prod, the stretch version of salt (as shipped by Debian) works f... [13:48:06] jdlrobson: reviewing 351166... [13:48:29] zeljkof: all is good in production [13:48:48] (03Merged) 10jenkins-bot: Update debian files [software/service-checker] - 10https://gerrit.wikimedia.org/r/352594 (owner: 10Giuseppe Lavagetto) [13:48:51] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:48:51] jdlrobson: great :) some trouble on the way, but should be fine now [13:49:55] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [13:51:35] (03Merged) 10jenkins-bot: pagePreviews: Fix NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [13:51:47] (03CR) 10jenkins-bot: pagePreviews: Fix NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [13:51:51] (03CR) 10Andrew Bogott: "I tested this on Labs and it works fine. Do we want to use that fact on prod as well rather than switching based on master?" [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [13:52:11] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 37.19 ms [13:53:44] jdlrobson: 351166 is at mwdebug1002, please test [13:54:03] zeljkof: that's good to go! :) [13:54:12] jdlrobson: ok, deploying [13:55:20] (03PS2) 10Giuseppe Lavagetto: Fix data structure for jobs [software/service-checker] - 10https://gerrit.wikimedia.org/r/352595 [13:55:28] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:351166|pagePreviews: Fix NavPopups gadget detection (T164044)]] (duration: 00m 39s) [13:55:35] jdlrobson: 351166 is at production, please test [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:38] T164044: Navigational popups and page previews appearing simultaneously on simplewiki - https://phabricator.wikimedia.org/T164044 [13:55:57] (03CR) 10Volans: [C: 031] "LGTM" [software/service-checker] - 10https://gerrit.wikimedia.org/r/352595 (owner: 10Giuseppe Lavagetto) [13:56:39] (03PS2) 10Zfilipin: Add Bengali logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352547 (https://phabricator.wikimedia.org/T164652) (owner: 10Jdlrobson) [13:58:47] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix data structure for jobs [software/service-checker] - 10https://gerrit.wikimedia.org/r/352595 (owner: 10Giuseppe Lavagetto) [13:59:07] production is good [13:59:10] ^ zeljkof [13:59:27] jdlrobson: great, working on 352547... [14:00:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352547 (https://phabricator.wikimedia.org/T164652) (owner: 10Jdlrobson) [14:01:13] 06Operations, 10Traffic: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3243905 (10BBlack) We could do so as a goal at the end of the process, depending how we arrange things. @RobH says we're short on power there to plug in all the new systems while the old ones are running. So t... [14:01:14] (03Merged) 10jenkins-bot: Add Bengali logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352547 (https://phabricator.wikimedia.org/T164652) (owner: 10Jdlrobson) [14:01:26] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, just two typos" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350555 (owner: 10Giuseppe Lavagetto) [14:01:28] (03CR) 10jenkins-bot: Add Bengali logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352547 (https://phabricator.wikimedia.org/T164652) (owner: 10Jdlrobson) [14:01:40] (03Abandoned) 10Filippo Giunchedi: mediawiki: alarm on session loss and bad tokens [puppet] - 10https://gerrit.wikimedia.org/r/256422 (https://phabricator.wikimedia.org/T108985) (owner: 10Filippo Giunchedi) [14:02:08] !log extending eu swat for a few minutes [14:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:40] jdlrobson: 352547 is at mwdebug1002, please test [14:05:04] perfect zeljkof [14:05:12] that can hit production [14:05:16] sorry all this has taken longer than expected [14:05:21] Urbanecm, kart_, gilles: sorry, the time for eu swat is up, my connection was slow and we had some problems [14:05:52] please reschedule your commits for a swat later today or tomorrow [14:06:07] jdlrobson: no problems, there were problems on my end, and with scap too [14:06:25] zeljkof: Okay, if we can't SWAT now, I can reschedule it. [14:06:25] jdlrobson: please don't go yet, I need you to check something in a few minutes [14:06:27] :/ [14:06:32] no problem [14:06:43] jdlrobson: deploying 352547 to production [14:08:21] !log zfilipin@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-bn.svg: SWAT: [[gerrit:352547|Add Bengali logo to mobile site (T164652)]] (duration: 00m 39s) [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:29] T164652: Use the correct Bengali Wikipedia wordmark on mobile site - https://phabricator.wikimedia.org/T164652 [14:09:10] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:352547|Add Bengali logo to mobile site (T164652)]] (duration: 00m 39s) [14:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:27] jdlrobson: deployed 352547 to production, please check [14:09:33] zeljkof: rescheduled my patch tomorrow :) [14:09:54] awesome. thanks zeljkof verified [14:10:16] Urbanecm, kart_: apologies one more time, see you tomorrow :) [14:10:29] jdlrobson: could you please check this error, it is not going away from the logs [14:10:38] 501 Notice: Undefined variable: wmgRelatedArticlesShowInSidebar in /srv/mediawiki/wmf-config/CommonSettings.php on line 2893 [14:11:10] looks like $wmgRelatedArticlesShowInSidebar is not defined? [14:11:35] zeljkof: let me take a look [14:12:37] (03PS1) 10Jdlrobson: $wmgRelatedArticlesShowInSidebar is now undefined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352596 [14:12:44] https://gerrit.wikimedia.org/r/352596 < zeljkof that will get rid of it [14:12:57] jdlrobson: can you stay around until that is deployed? [14:13:01] jouncebot: now [14:13:01] No deployments scheduled for the next 2 hour(s) and 46 minute(s) [14:13:05] zeljkof: sure [14:13:22] Zppix: eu swat is running a bit longer than expected [14:14:53] (03PS3) 10Elukey: Sqoop using the pre-generated orm jar [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric) [14:15:06] !log touch /forcefsck && /sbin/reboot labservices1002 [14:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352596 (owner: 10Jdlrobson) [14:15:50] (03CR) 10Elukey: [C: 032] "Looks good from https://puppet-compiler.wmflabs.org/6320/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/351857 (https://phabricator.wikimedia.org/T143119) (owner: 10Milimetric) [14:16:36] (03Merged) 10jenkins-bot: $wmgRelatedArticlesShowInSidebar is now undefined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352596 (owner: 10Jdlrobson) [14:16:44] (03CR) 10jenkins-bot: $wmgRelatedArticlesShowInSidebar is now undefined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352596 (owner: 10Jdlrobson) [14:17:51] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:18:38] that seems bad [14:18:57] or is it on labs1002? [14:19:06] !log Run pt-table-checksum on s7.arwiki - T163190 [14:19:11] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 38.66 ms [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [14:19:39] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:352596|$wmgRelatedArticlesShowInSidebar is now undefined]] (duration: 00m 39s) [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:14] jdlrobson: 352596 is deployed, the number of error messages in the logs is decreasing [14:20:23] !log eu swat finished! [14:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:19] zeljkof: sweet. doesnt seem to have nterfered with anything [14:21:31] jdlrobson: no, just log spam [14:21:38] as far as I can see [14:23:14] <_joe_> !log uploading new version of service-checker to reprepro [14:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:31] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:27:11] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 37.81 ms [14:27:44] ^ me sorry [14:29:46] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Jgreen) [14:36:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 439417.50 seconds [14:37:55] came back from downtime [14:38:01] I will silence it as the alter is still running [14:40:02] (03PS1) 10Muehlenhoff: Add elasticsearch-curator for jessie [puppet] - 10https://gerrit.wikimedia.org/r/352598 [14:41:42] (03CR) 10Muehlenhoff: [C: 032] Add elasticsearch-curator for jessie [puppet] - 10https://gerrit.wikimedia.org/r/352598 (owner: 10Muehlenhoff) [14:43:17] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3244147 (10Papaul) p:05Triage>03Normal [14:43:51] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:46:05] Is that planed ^^? [14:47:07] paladox: chasemp mentioned it was him, so I guess he is working on it [14:47:19] Ah, ok. Thanks. [14:49:13] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3244176 (10fgiunchedi) @gilles indeed for "number of thumbnails stored" I think it makes sense to extract a topN. Anyways I'm still thinking throu... [14:50:11] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [15:07:06] (03CR) 10Umherirrender: [C: 04-1] wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) (owner: 10Reedy) [15:13:27] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3244260 (10Cmjohnson) I really do not know where else they can go...each one of these requires 6u of space and has to be in (prefer... [15:13:30] (03CR) 10Faidon Liambotis: [C: 032] Fix typo in stretch config [puppet] - 10https://gerrit.wikimedia.org/r/351640 (owner: 10Muehlenhoff) [15:17:40] (03PS1) 10BBlack: varnish: re-size (again) frontend memory [puppet] - 10https://gerrit.wikimedia.org/r/352601 [15:19:23] (03CR) 10BBlack: [V: 032 C: 032] varnish: re-size (again) frontend memory [puppet] - 10https://gerrit.wikimedia.org/r/352601 (owner: 10BBlack) [15:22:41] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [15:26:21] RECOVERY - Varnish HTCP daemon on cp4016 is OK: PROCS OK: 1 process with UID = 113 (vhtcpd), args vhtcpd [15:27:37] (03PS2) 10Gehel: ELK - upgrade reprepro to version 5.3.2 of ELK [puppet] - 10https://gerrit.wikimedia.org/r/352590 (https://phabricator.wikimedia.org/T163705) [15:28:32] !log cp4016 repooled [15:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:32] (03CR) 10EBernhardson: [C: 031] ELK - upgrade reprepro to version 5.3.2 of ELK [puppet] - 10https://gerrit.wikimedia.org/r/352590 (https://phabricator.wikimedia.org/T163705) (owner: 10Gehel) [15:29:57] (03PS6) 10Andrew Bogott: Labs/salt: Use sha256 salt fingerprint with newer salt versions [puppet] - 10https://gerrit.wikimedia.org/r/351914 [15:30:08] (03PS3) 10Gehel: ELK - upgrade reprepro to version 5.3.2 of ELK [puppet] - 10https://gerrit.wikimedia.org/r/352590 (https://phabricator.wikimedia.org/T163705) [15:31:01] (03CR) 10jerkins-bot: [V: 04-1] Labs/salt: Use sha256 salt fingerprint with newer salt versions [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [15:31:01] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:53] (03CR) 10Gehel: [C: 032] ELK - upgrade reprepro to version 5.3.2 of ELK [puppet] - 10https://gerrit.wikimedia.org/r/352590 (https://phabricator.wikimedia.org/T163705) (owner: 10Gehel) [15:34:47] (03PS7) 10Andrew Bogott: Labs/salt: Use sha256 salt fingerprint with newer salt versions [puppet] - 10https://gerrit.wikimedia.org/r/351914 [15:35:22] 06Operations, 10hardware-requests: Unmanaged switch for eqiad frack - https://phabricator.wikimedia.org/T164561#3244345 (10RobH) a:05Cmjohnson>03faidon Ok, after IRC chat, 24 ports is too small (which is one of my concerns). The other (two mgmt switches wasting space in rack) are not major issues and thus... [15:39:19] (03CR) 10Andrew Bogott: "ps 7 tested on Labs with both salt-common versions -- works fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [15:40:59] !log OS install on ganeti200[7-8] [15:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:24] (03PS1) 10Gehel: elastic - simplify configuration of elastic.co reprepro repositories [puppet] - 10https://gerrit.wikimedia.org/r/352605 (https://phabricator.wikimedia.org/T161908) [15:45:18] (03CR) 10Faidon Liambotis: [C: 031] "LGTM as far as production goes. For Labs, I'm wondering how you are planning to handle this for self-hosted saltmasters and vary the Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [15:47:11] (03CR) 10Muehlenhoff: [C: 031] elastic - simplify configuration of elastic.co reprepro repositories [puppet] - 10https://gerrit.wikimedia.org/r/352605 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [15:47:33] (03CR) 10Gehel: [C: 032] elastic - simplify configuration of elastic.co reprepro repositories [puppet] - 10https://gerrit.wikimedia.org/r/352605 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [15:53:45] (03PS1) 10Andrew Bogott: Nova policy: Open up quota-related queries [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) [15:54:39] (03PS2) 10Andrew Bogott: Nova policy: Open up quota-related queries [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) [15:55:50] (03CR) 10EBernhardson: [C: 031] Upgrade plugins for elastic 5.3.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse) [15:57:31] (03PS1) 10Marostegui: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352607 (https://phabricator.wikimedia.org/T163548) [15:59:01] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:00:30] (03PS1) 10Gehel: ELK - logstash package does not follow the same version naming [puppet] - 10https://gerrit.wikimedia.org/r/352608 (https://phabricator.wikimedia.org/T161908) [16:00:41] (03CR) 10Andrew Bogott: [C: 032] "The hiera keys will have to be done per-instance for now. Probably we'll eventually have to have two different hiera keys though :(" [puppet] - 10https://gerrit.wikimedia.org/r/351914 (owner: 10Andrew Bogott) [16:00:46] (03PS8) 10Andrew Bogott: Labs/salt: Use sha256 salt fingerprint with newer salt versions [puppet] - 10https://gerrit.wikimedia.org/r/351914 [16:00:51] (03PS1) 10Giuseppe Lavagetto: Fix double addition of url_prefix if it's defined in base_path as well. [software/service-checker] - 10https://gerrit.wikimedia.org/r/352609 [16:01:22] !log ganeti200[7-8] - signing puppet certs, salt-key, initial run [16:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:40] 06Operations, 10ops-codfw: audit all codfw pdu tower draws - https://phabricator.wikimedia.org/T163362#3244423 (10RobH) There also may be a idrac/ipmi command to query how the power supply units are drawing in the systems. Need to check. [16:06:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One minor nitpick, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/352092 (owner: 10Ayounsi) [16:09:28] (03PS2) 10Ayounsi: Librensm syslog to listen on v6 IPs as well as v4 [puppet] - 10https://gerrit.wikimedia.org/r/352092 [16:10:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix double addition of url_prefix if it's defined in base_path as well. [software/service-checker] - 10https://gerrit.wikimedia.org/r/352609 (owner: 10Giuseppe Lavagetto) [16:10:53] (03CR) 10Ayounsi: [C: 032] Librensm syslog to listen on v6 IPs as well as v4 [puppet] - 10https://gerrit.wikimedia.org/r/352092 (owner: 10Ayounsi) [16:13:42] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3244464 (10MoritzMuehlenhoff) p:05Triage>03Normal [16:15:51] (03PS3) 10Ayounsi: LibreNMS syslog to listen on v6 IPs as well as v4 [puppet] - 10https://gerrit.wikimedia.org/r/352092 [16:18:52] PROBLEM - puppet last run on ganeti2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[sysfsutils] [16:23:40] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3244524 (10Papaul) [16:23:49] (03CR) 10Elukey: "Any comments? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351854 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [16:23:52] RECOVERY - puppet last run on ganeti2008 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:24:25] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3218244 (10Papaul) a:05Papaul>03akosiaris @akosiaris This is complete on my side you can take over. [16:27:29] <_joe_> !log installing the new service-checker on restbase2001,scb2001 [16:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:55] <_joe_> mobrovac: ^^ that has 1) parallel url fetching 2) can report timing data to statsd about endpoints it checks [16:28:15] _joe_: <3 [16:28:49] <_joe_> 2) can be interesting in order to keep track of regressions in performance on a few endpoints [16:30:07] <_joe_> check-mobileapps takes 1.3 s instead of 5-6s, too [16:33:19] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#3244604 (10GWicke) I added some hints, and linked to the upstream service repository. Functionally, the electron render service is all that is needed... [16:41:39] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3244646 (10madhuvishy) @Cmjohnson and I spoke a bit on IRC, and we talked about moving back to row C, and possibly to C2 and C3, pr... [16:46:13] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3244656 (10BBlack) Updates from IRC-only work - a significant majority of our ICMP echo volume is coming from a large number of IPs owned by Google. TODO... [16:48:05] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3244672 (10Cmjohnson) [16:48:33] 06Operations, 10Trebuchet: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#3244676 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I believe the latest version of `git-fat` in T155856 fixed this, resolving [16:56:42] (03PS1) 10Smalyshev: Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1700). [17:00:40] _joe_: where's it fetching the urls from? [17:01:15] reading-web are also recording timing data for the summary endpoint from real clients [17:01:24] if we can supersede that, then great! [17:02:24] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10BBlack) [17:03:35] <_joe_> phuedx: not really, this is server-side [17:03:47] <_joe_> so the two things are complementary [17:04:00] <_joe_> also, this operates at a microservice level [17:04:39] neato [17:04:46] (03PS5) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 [17:05:03] (03CR) 10Dzahn: [C: 031] mariadb: grant user 'phstats' additional select on differential db (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [17:05:53] (03PS2) 10Dzahn: Fix "Tasks closed" SQL query in monthly Phab metrics report email [puppet] - 10https://gerrit.wikimedia.org/r/352125 (https://phabricator.wikimedia.org/T164297) (owner: 10Aklapper) [17:07:08] (03CR) 10Dzahn: [C: 031] "oh @manuel i had just added you "fyi" about having tested the query and that it took 6 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/352125 (https://phabricator.wikimedia.org/T164297) (owner: 10Aklapper) [17:07:20] !log gehel@tin Started deploy [wdqs/wdqs@e637cf0]: (no justification provided) [17:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3051441 (10jcrespo) No problem with me on this, it just takes a few ( minutes to depool/shutdown mysql/shutdown, etc... [17:08:14] 06Operations, 10ops-eqiad, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3244795 (10jcrespo) CC @marostegui, in case I was away on vacation. [17:08:56] !log gehel@tin Finished deploy [wdqs/wdqs@e637cf0]: (no justification provided) (duration: 01m 36s) [17:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:35] SMalyshev: WDQS deployment completed, tests are green [17:09:58] (03PS2) 10Gehel: ELK - logstash package does not follow the same version naming [puppet] - 10https://gerrit.wikimedia.org/r/352608 (https://phabricator.wikimedia.org/T161908) [17:13:52] (03CR) 10Dzahn: [C: 032] Fix "Tasks closed" SQL query in monthly Phab metrics report email [puppet] - 10https://gerrit.wikimedia.org/r/352125 (https://phabricator.wikimedia.org/T164297) (owner: 10Aklapper) [17:18:26] (03PS2) 10Smalyshev: Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 (https://phabricator.wikimedia.org/T164614) [17:20:48] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3244855 (10Gilles) Indeed, sounds like an easy win [17:28:58] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3244868 (10Papaul) @Volans Just want to mention that this out put is little bit misleading because the bad disk is not on heze but on heze-array1. [17:30:23] papaul: I'll look a it, thanks [17:33:10] papaul: can you give me some more details on it? [17:34:02] s/can/could/, like from where to get the heze-array1 ;) [17:35:32] (03PS1) 10Urbanecm: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) [17:35:42] RECOVERY - HP RAID on ms-be1028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:37:05] volans: sure [17:37:15] heze is connected to an array [17:37:31] the arrayhas 12 disk of 4TB [17:37:44] (03PS3) 10Dzahn: webperf: Decom asset-check service [puppet] - 10https://gerrit.wikimedia.org/r/352302 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [17:38:26] yes but how can I get the fact that the disk is on the array named heze-array1? [17:38:50] from megacli I get only Adapter #1, Virtual Drive 0, Span 0 (with 12 disks), as is in the task [17:39:19] (03CR) 10Dzahn: [C: 032] webperf: Decom asset-check service [puppet] - 10https://gerrit.wikimedia.org/r/352302 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [17:40:33] (03CR) 10Dzahn: [C: 031] Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm) [17:40:58] (03CR) 10Dzahn: "Info: Computing checksum on file /srv/webperf/asset-check.js" [puppet] - 10https://gerrit.wikimedia.org/r/352302 (https://phabricator.wikimedia.org/T164419) (owner: 10Krinkle) [17:42:31] (03CR) 10Dzahn: "yea, that's what i meant. Reedy just found better words for it." [puppet] - 10https://gerrit.wikimedia.org/r/352278 (owner: 10Paladox) [17:42:40] (03PS5) 10Dzahn: base::standard_packages: Remove ubuntu precise check [puppet] - 10https://gerrit.wikimedia.org/r/352278 (owner: 10Paladox) [17:44:23] (03CR) 10Dzahn: [C: 032] base::standard_packages: Remove ubuntu precise check [puppet] - 10https://gerrit.wikimedia.org/r/352278 (owner: 10Paladox) [17:45:02] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:11] mutante thanks ^^ [17:47:47] volans: i guess megacli is just reading the array as part of heze so now it is up to the onsite tech to know a bad disk is from the server or from the array [17:49:02] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [17:49:42] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [17:49:57] (03PS2) 10Dzahn: keyholder: fix provisioning on trusty [puppet] - 10https://gerrit.wikimedia.org/r/351571 (owner: 10BryanDavis) [17:50:38] (03CR) 10Dzahn: [C: 032] "yep, per "Most of the other uses of /etc/tmpfiles.d in our Puppet tree are already guarded with a jessie/systemd check."" [puppet] - 10https://gerrit.wikimedia.org/r/351571 (owner: 10BryanDavis) [17:50:42] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [17:50:43] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1800.0] [17:50:43] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 37.93% of data above the critical threshold [1800.0] [17:51:02] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [17:56:56] all of the wdqs lag [17:56:59] (03PS1) 10DatGuy: Enable mapframe on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352622 (https://phabricator.wikimedia.org/T164574) [17:57:36] jouncebot: refresh [17:57:38] I refreshed my knowledge about deployments. [17:57:55] damn that's a stacked morning swat [17:57:59] Who's SWATing, and can I add more? [17:58:04] (03CR) 10Deskana: [C: 04-1] "Please wait and ask the maintainers of this service for input (see Phab task)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352622 (https://phabricator.wikimedia.org/T164574) (owner: 10DatGuy) [17:58:28] * abian is here for the Morning SWAT :) [17:58:50] greg-g: FYI I added a deploy window at 19:00Z today for Striker. [17:59:37] RainbowSprinkles: ^ that Striker window will roll out the fix for the last SMW blocker on wikitech :) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1800). Please do the needful. [18:00:04] kaldari, abian, RoanKattouw, phuedx, Gilles, and SMalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] o/ [18:00:45] I can do the SWAT today, if people are here [18:00:51] Looks like abian and James_F checked in already [18:01:18] RoanKattouw: It's currently 10 patches (5 branch, 5 config)… [18:01:25] Fine by me [18:01:28] o/ [18:01:29] Kk. [18:01:52] bd808: sweet [18:02:11] \o/ [18:02:38] (03CR) 10Paladox: [C: 031] "This should not be done until 2.14 due to 2.13 exposing passwords in non hashed formats." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [18:03:04] phuedx: I think you may have typoed [1.29.0-wmf.21] 351382 Track and discard duplicate enqueued events ? See https://gerrit.wikimedia.org/r/#/c/351382/ [18:03:27] < professional [18:03:35] thanks RoanKattouw [18:03:54] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/352555/ and https://gerrit.wikimedia.org/r/#/c/352554/ are phuedx's. [18:04:03] ^ that [18:04:10] https://gerrit.wikimedia.org/r/#/q/status:open+branch:wmf/1.29.0-wmf.21 is my go-to. [18:04:14] Gotcha [18:04:16] will update the deployments page [18:04:20] Thanks [18:04:37] I'm here [18:04:52] OK, I've got all the wmf21 ones cooking in Jenkins, starting with config changes now [18:05:12] (03PS2) 10Catrope: Fix colors to match style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) (owner: 10Abián) [18:05:40] Thanks :) [18:07:13] (03PS3) 10Catrope: Fix colors to match style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) (owner: 10Abián) [18:07:18] (03CR) 10Catrope: [C: 032] Fix colors to match style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) (owner: 10Abián) [18:08:21] Apologies for the delay there, Jenkins is a bit overloaded (which is entirely my fault) [18:10:16] (03CR) 10Paladox: [C: 04-1] "Blocked until we update to 2.14." [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [18:10:49] papaul: sorry for the delay, was afk. Ok, let me know if you find a way to grab that info I'll update the script ;) [18:12:47] * phuedx makes a mug of horlicks [18:14:48] Hmm, Zend's prioritization isn't working optimally [18:14:59] It just started a test job even though there are outstanding gate-and-submit jobs [18:15:07] !log restarting wdqs-updater [18:15:08] s/Zend/Zuul [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:32] RoanKattouw: It's a per-job limit. [18:15:45] Aha [18:15:52] Well I did just blow up the gate-and-submit queue [18:16:23] Yes. [18:21:07] 06Operations, 10hardware-requests: Unmanaged switch for eqiad frack - https://phabricator.wikimedia.org/T164561#3245033 (10faidon) a:05faidon>03RobH That's fine, approved :) (shouldn't this be in #procurement?) [18:21:43] (03PS2) 10Dzahn: Remove the ipv6::relay/miredo manifests [puppet] - 10https://gerrit.wikimedia.org/r/350754 (owner: 10Faidon Liambotis) [18:21:55] 06Operations, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10ayounsi) [18:24:15] It's now finally started the last job for that first config patch [18:24:51] (03Merged) 10jenkins-bot: Fix colors to match style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) (owner: 10Abián) [18:24:59] Yay, finally [18:25:55] (03CR) 10jenkins-bot: Fix colors to match style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350697 (https://phabricator.wikimedia.org/T163048) (owner: 10Abián) [18:26:17] !log catrope@tin Synchronized static/images/project-logos/: T163048 (duration: 00m 39s) [18:26:23] (03PS3) 10Catrope: Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 (https://phabricator.wikimedia.org/T164614) (owner: 10Smalyshev) [18:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:26] T163048: Showing wrong colors in the Wikidata logo - https://phabricator.wikimedia.org/T163048 [18:26:33] (03CR) 10Catrope: [C: 032] Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 (https://phabricator.wikimedia.org/T164614) (owner: 10Smalyshev) [18:26:49] abian: Yours is done [18:27:05] SMalyshev: Yours will be on mwdebug1002 in a minute [18:27:08] (03CR) 10Dzahn: [C: 032] Remove the ipv6::relay/miredo manifests [puppet] - 10https://gerrit.wikimedia.org/r/350754 (owner: 10Faidon Liambotis) [18:27:25] RoanKattouw: I'd need it on terbium, it's only testable from command line [18:27:37] OK, will put it there once it merges [18:27:43] cool, thanks [18:28:17] (03Merged) 10jenkins-bot: Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 (https://phabricator.wikimedia.org/T164614) (owner: 10Smalyshev) [18:28:25] (03CR) 10jenkins-bot: Make fields config apply to test wikidata too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352617 (https://phabricator.wikimedia.org/T164614) (owner: 10Smalyshev) [18:28:42] SMalyshev: On terbium [18:28:52] checking [18:29:12] RoanKattouw: seems to be working fine! [18:29:41] Thanks [18:30:11] kaldari: You ready for the cookie blocking pach? [18:30:17] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T164614 (duration: 00m 40s) [18:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:25] T164614: Field limit error when reindexing testwikidatawiki - https://phabricator.wikimedia.org/T164614 [18:30:59] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245083 (10Jgreen) [18:31:07] RoanKattouw: Yes [18:31:12] James_F, phuedx, gilles: Your changes are on mwdebug1002, please test [18:31:18] (03PS3) 10Catrope: Enable cookie blocking on all remaining production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350090 (https://phabricator.wikimedia.org/T162651) (owner: 10Kaldari) [18:31:31] (03CR) 10Catrope: [C: 032] Enable cookie blocking on all remaining production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350090 (https://phabricator.wikimedia.org/T162651) (owner: 10Kaldari) [18:31:44] RoanKattouw: I still can't see wikidatawiki.png updated, only wikidata.png [18:31:52] Perhaps an issue with my cache? [18:31:59] Maybe? I did sync the entire directory [18:32:00] RoanKattouw: mine works as expected [18:32:14] RoanKattouw: thank you [18:32:43] (03PS2) 10Faidon Liambotis: Fix typo in stretch config [puppet] - 10https://gerrit.wikimedia.org/r/351640 (owner: 10Muehlenhoff) [18:32:44] RoanKattouw: Both of mine LGTM. [18:32:44] Both should be the same file, SHA-1: 0x5b78a8430ce339b4d606062cb04e11c59e9f046c [18:32:51] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Fix typo in stretch config [puppet] - 10https://gerrit.wikimedia.org/r/351640 (owner: 10Muehlenhoff) [18:33:00] !log maxsem@tin Started deploy [tilerator/deploy@001811e]: 001811e1a3eb21cf9246c5425f27f001b91efd27, was in testing for 3 weeks [18:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:18] !log maxsem@tin Finished deploy [tilerator/deploy@001811e]: 001811e1a3eb21cf9246c5425f27f001b91efd27, was in testing for 3 weeks (duration: 00m 20s) [18:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:50] !log catrope@tin Synchronized php-1.29.0-wmf.21/includes: T100999 (duration: 01m 24s) [18:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:58] T100999: Make the logo's loading priority higher - https://phabricator.wikimedia.org/T100999 [18:34:11] (03Merged) 10jenkins-bot: Enable cookie blocking on all remaining production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350090 (https://phabricator.wikimedia.org/T162651) (owner: 10Kaldari) [18:34:19] abian: Hah I'm seeing the reverse [18:34:24] curl 'https://en.wikipedia.org/w/static/images/project-logos/wikidatawiki.png' |sha1sum [18:34:28] 5b78a8430ce339b4d606062cb04e11c59e9f046c [18:34:33] curl 'https://en.wikipedia.org/w/static/images/project-logos/wikidata.png' |sha1sum [18:34:37] 19ccf1090024056f124915916512ed8a154a9964 [18:34:46] xD [18:34:47] Hopefully it'll fix itself over time [18:34:51] Great [18:35:07] Thanks, RoanKattouw :) [18:35:42] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:35:42] PROBLEM - Webrequests Varnishkafka log producer on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:35:43] (03CR) 10jenkins-bot: Enable cookie blocking on all remaining production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350090 (https://phabricator.wikimedia.org/T162651) (owner: 10Kaldari) [18:35:44] PROBLEM - MD RAID on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:35:45] ACKNOWLEDGEMENT - MD RAID on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164779 [18:35:48] 06Operations, 10ops-ulsfo: Degraded RAID on cp4006 - https://phabricator.wikimedia.org/T164779#3245094 (10ops-monitoring-bot) [18:35:57] RoanKattouw: lgtm, tested popups/page previews anon and logged in and no regressions/errors in the console [18:36:06] Thanks [18:36:10] James's going out now, Sam's netx [18:36:12] PROBLEM - Nginx local proxy to apache on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:32] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:39] !log catrope@tin Synchronized php-1.29.0-wmf.21/extensions/VisualEditor/modules/ve-mw/dm/metaitems/ve.dm.MWFlaggedMetaItem.js: T164054 (duration: 00m 38s) [18:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:48] T164054: [Regression] English magic words replaced with opposite localized ones while editing in VE - https://phabricator.wikimedia.org/T164054 [18:36:50] (03PS1) 10Dzahn: librenms: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352631 [18:36:52] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:12] PROBLEM - Freshness of zerofetch successful run file on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:32] PROBLEM - Confd vcl based reload on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:32] PROBLEM - Disk space on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:32] PROBLEM - salt-minion processes on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:32] PROBLEM - dhclient process on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:52] PROBLEM - Varnish traffic logger - varnishstatsd on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:38:12] RECOVERY - Freshness of zerofetch successful run file on cp4006 is OK: OK [18:38:17] !log catrope@tin Synchronized php-1.29.0-wmf.21/extensions/VisualEditor/extension.json: T164472 (duration: 00m 39s) [18:38:22] PROBLEM - Varnish traffic logger - varnishxcps on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:26] T164472: Label not displayed while VEditing on mobile site - https://phabricator.wikimedia.org/T164472 [18:38:35] (03PS1) 10Dzahn: ocg: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352633 [18:39:10] cp4006 is probably the oom issue again [18:39:20] 06Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3245106 (10DarTar) [18:39:22] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:22] PROBLEM - traffic-pool service on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:23] PROBLEM - configured eth on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:23] PROBLEM - confd service on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:23] PROBLEM - Freshness of OCSP Stapling files on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:23] PROBLEM - Check systemd state on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:23] PROBLEM - Varnish traffic logger - varnishreqstats on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:32] RECOVERY - dhclient process on cp4006 is OK: PROCS OK: 0 processes with command name dhclient [18:39:43] PROBLEM - IPsec on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:39:48] 06Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10DarTar) We removed the #rd tag and will follow up if there's any additional approval needed. [18:39:54] yup [18:40:02] PROBLEM - DPKG on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:40:11] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4006.ulsfo.wmnet [18:40:12] PROBLEM - Varnish traffic logger - varnishmedia on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:24] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4006.ulsfo.wmnet [18:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:03] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:41:03] PROBLEM - Varnish HTCP daemon on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:41:03] PROBLEM - Varnish traffic logger - varnishxcache on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:41:03] PROBLEM - SSH on cp4006 is CRITICAL: Server answer [18:41:03] (03PS2) 10Catrope: Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498) [18:41:11] !log catrope@tin Synchronized php-1.29.0-wmf.21/extensions/Popups/: T163198 (duration: 00m 39s) [18:41:12] PROBLEM - Freshness of zerofetch successful run file on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:41:14] (03CR) 10Catrope: [C: 032] Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498) (owner: 10Catrope) [18:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:19] T163198: Track instances of duplicate Popups events being logged - https://phabricator.wikimedia.org/T163198 [18:41:22] RECOVERY - Varnish traffic logger - varnishxcps on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [18:41:22] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4006 is OK: No errors detected [18:41:22] RECOVERY - configured eth on cp4006 is OK: OK - interfaces up [18:41:22] RECOVERY - traffic-pool service on cp4006 is OK: OK - traffic-pool is active [18:41:23] RECOVERY - confd service on cp4006 is OK: OK - confd is active [18:41:23] RECOVERY - Check systemd state on cp4006 is OK: OK - running: The system is fully operational [18:41:23] RECOVERY - Varnish traffic logger - varnishreqstats on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [18:41:24] RECOVERY - Freshness of OCSP Stapling files on cp4006 is OK: OK [18:41:25] (03PS1) 10Dzahn: elasticsearch: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352635 [18:41:32] RECOVERY - Confd vcl based reload on cp4006 is OK: reload-vcl successfully ran 2h, 25 minutes ago. [18:41:32] RECOVERY - Disk space on cp4006 is OK: DISK OK [18:41:32] RECOVERY - salt-minion processes on cp4006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:41:42] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp4006 is OK: No errors detected [18:41:42] RECOVERY - Webrequests Varnishkafka log producer on cp4006 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [18:41:44] RECOVERY - MD RAID on cp4006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:41:44] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [18:41:52] RECOVERY - Varnish traffic logger - varnishstatsd on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [18:42:02] RECOVERY - DPKG on cp4006 is OK: All packages OK [18:42:02] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:03] RECOVERY - Varnish traffic logger - varnishxcache on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcache, UID = 0 (root) [18:42:03] RECOVERY - Varnish HTCP daemon on cp4006 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [18:42:03] RECOVERY - SSH on cp4006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [18:42:12] RECOVERY - Freshness of zerofetch successful run file on cp4006 is OK: OK [18:42:12] RECOVERY - Varnish traffic logger - varnishmedia on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [18:42:33] (03Merged) 10jenkins-bot: Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498) (owner: 10Catrope) [18:42:42] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:43] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:43] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:46] (03CR) 10jenkins-bot: Enable Flow beta feature on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351891 (https://phabricator.wikimedia.org/T164498) (owner: 10Catrope) [18:43:43] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [18:44:06] !log running varnish frontend restarts to fix memory sizing on 96GB and 192GB hosts over the next ~45m (mostly maps+misc hosts) [18:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:51] (03PS2) 10Catrope: Tweak ORES thresholds for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352297 (https://phabricator.wikimedia.org/T164621) [18:45:09] (03PS1) 10Dzahn: openstack: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352636 [18:45:11] (03CR) 10Catrope: [C: 032] Tweak ORES thresholds for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352297 (https://phabricator.wikimedia.org/T164621) (owner: 10Catrope) [18:45:24] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T164498 (duration: 00m 39s) [18:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:31] T164498: Enable Flow on ca Wikiquote as Beta Feature - https://phabricator.wikimedia.org/T164498 [18:47:11] (03Merged) 10jenkins-bot: Tweak ORES thresholds for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352297 (https://phabricator.wikimedia.org/T164621) (owner: 10Catrope) [18:47:20] (03CR) 10jenkins-bot: Tweak ORES thresholds for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352297 (https://phabricator.wikimedia.org/T164621) (owner: 10Catrope) [18:49:45] (03PS1) 10Dzahn: etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352637 [18:49:54] !log cp4006 repooled (frontend restarted) [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:22] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 87 failures. Last run 55 seconds ago with 87 failures. Failed resources (up to 3 shown): Package[python3],Package[ack-grep],Package[python-etcd],Package[screen] [18:50:38] !log running varnish frontend restarts to fix memory sizing on 256G+ hosts over the next ~4.5 h (mostly text+upload hosts) [18:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:12] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:52:54] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T164621 (duration: 00m 39s) [18:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:02] T164621: Adjust ORES levels on en.wiki to get better overlap between good faith and damage - https://phabricator.wikimedia.org/T164621 [18:54:20] (03PS1) 10Dzahn: mediawiki::jobrunner: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352638 [18:56:23] (03PS1) 10Dzahn: graphite: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352639 [18:57:20] (03CR) 10jerkins-bot: [V: 04-1] graphite: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352639 (owner: 10Dzahn) [18:58:14] (03PS1) 10Dzahn: base::puppet: use logrotate:;conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352640 [18:58:30] !log Restarted tilerator and tileratorui across the cluster [18:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:30] (03PS2) 10Dzahn: graphite: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352639 [19:00:04] bd808: Dear anthropoid, the time has come. Please deploy Striker (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1900). [19:00:09] o/ [19:00:13] (03PS3) 10Gehel: ELK - logstash package does not follow the same version naming [puppet] - 10https://gerrit.wikimedia.org/r/352608 (https://phabricator.wikimedia.org/T161908) [19:00:31] RoanKattouw: are you all done swatting? [19:00:31] 06Operations, 06Operations-Software-Development: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10faidon) [19:00:53] 06Operations, 06Operations-Software-Development: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245209 (10faidon) [19:00:57] (03PS2) 10Dzahn: base::puppet: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352640 [19:03:29] (03CR) 10Gehel: [C: 032] ELK - logstash package does not follow the same version naming [puppet] - 10https://gerrit.wikimedia.org/r/352608 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [19:03:41] (03PS1) 10Dzahn: services: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352641 [19:04:22] 06Operations, 06Operations-Software-Development: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10jcrespo) [] retire/and or substitute custom salt grains from puppet (?) [19:04:57] (03CR) 10jerkins-bot: [V: 04-1] services: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352641 (owner: 10Dzahn) [19:07:14] !log Applied database migration for T162508 to striker database on m5-master [19:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:22] T162508: Implement Tool Labs membership application and processing in Striker - https://phabricator.wikimedia.org/T162508 [19:07:22] 06Operations, 06Operations-Software-Development: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245221 (10faidon) [19:08:07] bd808: Yes, sorry [19:08:18] 06Operations, 06Labs: Stretch vs. Salt - https://phabricator.wikimedia.org/T164595#3245223 (10faidon) We should stop relying on (ancient versions of) SaltStack altogether, for security reasons among other things. I personally prefer the risk of wildly different salt versions compared to wildly outdated salt ve... [19:08:29] (03CR) 10Dzahn: [C: 031] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [19:08:34] RoanKattouw: thanks. I figured you were when you didn't respond :) [19:12:07] (03CR) 10Gehel: [C: 032] Upgrade plugins for elastic 5.3.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse) [19:12:17] (03CR) 10Gehel: [V: 032 C: 032] Upgrade plugins for elastic 5.3.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/351819 (https://phabricator.wikimedia.org/T160948) (owner: 10DCausse) [19:13:17] (03PS1) 10Dzahn: squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 [19:13:48] (03PS2) 10Dzahn: squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 [19:14:43] (03CR) 10jerkins-bot: [V: 04-1] squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 (owner: 10Dzahn) [19:15:13] !log Forced puppet run on californium to provision new striker config settings [19:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:49] I'll start deploying logstash upgrade in a few minutes. We'll stop collecting logs during the elasticsearch upgarde, which should take only a few minutes. [19:16:00] (03PS1) 10Andrew Bogott: Added dummy uwsgi creds [labs/private] - 10https://gerrit.wikimedia.org/r/352644 [19:16:17] (03PS2) 10Andrew Bogott: Added dummy striker creds [labs/private] - 10https://gerrit.wikimedia.org/r/352644 [19:16:20] (03PS1) 10Dzahn: puppetmaster: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352645 [19:16:37] bd808: any conflict for the logstash upgrade with your current deployment? [19:17:01] !log bd808@tin Started deploy [striker/deploy@3836477]: Implement Tool Labs membership application and processing (T162508) [19:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:09] T162508: Implement Tool Labs membership application and processing in Striker - https://phabricator.wikimedia.org/T162508 [19:17:31] gehel: nope [19:17:33] !log bd808@tin Finished deploy [striker/deploy@3836477]: Implement Tool Labs membership application and processing (T162508) (duration: 00m 32s) [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:41] bd808: ok, thanks! [19:17:53] (03PS2) 10Dzahn: services: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352641 [19:19:18] (03PS3) 10Dzahn: squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 [19:28:02] !log starting ELK (logstash) upgrade - T161908 [19:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:15] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [19:34:13] !log restarted varnishxcache service on cp3031, was malfunctioning and sending crazy stats to grafana... [19:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:51] !log Deployment of Striker for T162508 complete; will continue debug keystone issue that is preventing Tool Labs membership requests from being approved [19:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:59] T162508: Implement Tool Labs membership application and processing in Striker - https://phabricator.wikimedia.org/T162508 [19:44:05] (03PS1) 10Gehel: logstash upgrade to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/352648 (https://phabricator.wikimedia.org/T161908) [19:46:10] (03CR) 10EBernhardson: [C: 031] logstash upgrade to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/352648 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [19:46:15] (03CR) 10Gehel: [C: 032] logstash upgrade to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/352648 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [19:47:21] !log logstash / elasticsearch downtime coming up - T161908 [19:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [19:47:59] (03PS1) 10Dzahn: camus: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352649 [19:51:05] (03PS1) 10Dzahn: confd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352652 [19:56:58] (03PS1) 10Dzahn: base::puppet: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352654 [19:58:11] (03PS1) 10Gehel: elasticsearch - do not define gelf appender if it is unused [puppet] - 10https://gerrit.wikimedia.org/r/352655 [19:59:12] (03PS1) 10Dzahn: rsyslog: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352656 [19:59:19] (03CR) 10EBernhardson: [C: 031] elasticsearch - do not define gelf appender if it is unused [puppet] - 10https://gerrit.wikimedia.org/r/352655 (owner: 10Gehel) [19:59:29] (03CR) 10Gehel: [C: 032] elasticsearch - do not define gelf appender if it is unused [puppet] - 10https://gerrit.wikimedia.org/r/352655 (owner: 10Gehel) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T2000). Please do the needful. [20:00:49] (03PS1) 10Dzahn: snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 [20:02:17] (03PS1) 10Dzahn: salt: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352658 [20:02:40] arlo will be doing a parsoid deploy in a little bit once he has verified a few things. [20:02:50] !log restarting elasticsearch on logstash cluster after upgrade - T161908 [20:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:59] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [20:03:54] (03PS1) 10Dzahn: dynamicproxy: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352660 [20:05:40] (03PS1) 10Dzahn: profile::base: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352661 [20:14:44] (03PS1) 10Dzahn: ocg: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352664 [20:19:10] (03PS18) 10EBernhardson: Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [20:20:32] (03CR) 10Gehel: [C: 032] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [20:21:56] !log upgrading kibana on logstash cluster - T161908 [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:05] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [20:24:32] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - kibana_80 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [20:24:32] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - kibana_80 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [20:24:32] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - kibana_80 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [20:25:08] PROBLEM - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [20:25:22] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - kibana_80 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [20:26:10] ^ that's the kibana upgrade taking more time than expected [20:26:17] shoudl be good in a few seconds [20:26:32] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [20:27:08] RECOVERY - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 3248 bytes in 0.079 second response time [20:27:20] :) [20:27:22] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [20:27:32] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [20:27:32] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [20:27:51] !log restarted kibana on logstash cluster - T161908 [20:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:59] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [20:30:03] !log arlolra@tin Started deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 [20:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:43] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#3245555 (10TheDJ) @GWicke Thank You ! [20:34:54] !log arlolra@tin Finished deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 (duration: 04m 50s) [20:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:38] (03PS1) 10Gehel: kibana - cleanup of /opt/kibana has been done [puppet] - 10https://gerrit.wikimedia.org/r/352666 (https://phabricator.wikimedia.org/T161908) [20:37:09] (03CR) 10EBernhardson: [C: 031] kibana - cleanup of /opt/kibana has been done [puppet] - 10https://gerrit.wikimedia.org/r/352666 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [20:37:42] !log silencing elasticsearch shard incinga check, recovery after upgrade is going to take a long time - T161908 [20:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:50] T161908: ELK 5.x deployment plan - https://phabricator.wikimedia.org/T161908 [20:44:40] (03PS3) 10BryanDavis: Added dummy striker creds [labs/private] - 10https://gerrit.wikimedia.org/r/352644 (owner: 10Andrew Bogott) [20:44:49] (03CR) 10BryanDavis: [C: 031] Added dummy striker creds [labs/private] - 10https://gerrit.wikimedia.org/r/352644 (owner: 10Andrew Bogott) [20:47:11] !log arlolra@tin Started deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 [20:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:48] !log arlolra@tin Finished deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 (duration: 01m 36s) [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:40] (03PS3) 10Dzahn: nagios_common: test contact template with private/hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/352269 [20:53:31] (03CR) 10Dzahn: [C: 032] "just testing if direct lookup from private hiera fails only in labs or both prod and labs to help with debug. will most likely revert it r" [puppet] - 10https://gerrit.wikimedia.org/r/352269 (owner: 10Dzahn) [20:59:42] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T2100). Please do the needful. [21:00:44] tegmen: known, me. will fix that in a few [21:00:54] just testing why that doesnt work [21:02:33] !log arlolra@tin Started deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 [21:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:16] !log arlolra@tin Finished deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 (duration: 02m 43s) [21:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:24] !log mobrovac@tin Started deploy [restbase/deploy@c70a1e1]: Remove the mobile-text end point - T158128 [21:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:31] T158128: Deprecate and delete mobile-text endpoint - https://phabricator.wikimedia.org/T158128 [21:06:37] (03PS4) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [21:12:47] !log mobrovac@tin Finished deploy [restbase/deploy@c70a1e1]: Remove the mobile-text end point - T158128 (duration: 06m 23s) [21:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:56] T158128: Deprecate and delete mobile-text endpoint - https://phabricator.wikimedia.org/T158128 [21:17:12] !log mobrovac@tin Started deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 [21:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:20] T164046: graph's mapsnapshot protocol incorrectly validates map style parameter - https://phabricator.wikimedia.org/T164046 [21:17:50] !log mobrovac@tin Finished deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 (duration: 00m 38s) [21:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:18] !log mobrovac@tin Started deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 [21:18:22] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:09] !log mobrovac@tin Finished deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 (duration: 03m 51s) [21:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:49] (03PS3) 10Paladox: Test: DO NOT MERGE [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [21:24:04] ignore that ^^ [21:24:11] !log mobrovac@tin Started deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 [21:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:19] T164046: graph's mapsnapshot protocol incorrectly validates map style parameter - https://phabricator.wikimedia.org/T164046 [21:25:51] !log mobrovac@tin Finished deploy [graphoid/deploy@a288409]: Switched to npm-stored graph-shared, fix mapsnapshot - T164046 (duration: 01m 39s) [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:32] PROBLEM - Check systemd state on cp3035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:32:42] PROBLEM - MariaDB Slave SQL: s3 on db2057 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table testwiki.echo_notification doesnt exist on query. Default database: testwiki. [Query snipped] [21:34:55] (03Draft1) 10Paladox: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 [21:34:58] (03Draft2) 10Paladox: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 [21:38:32] RECOVERY - Check systemd state on cp3035 is OK: OK - running: The system is fully operational [21:40:52] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.97 seconds [21:48:32] PROBLEM - Confd vcl based reload on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:32] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:32] PROBLEM - configured eth on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:32] PROBLEM - Disk space on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:33] PROBLEM - Freshness of OCSP Stapling files on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:33] PROBLEM - Varnish traffic logger - varnishstatsd on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:33] PROBLEM - Check systemd state on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:34] PROBLEM - dhclient process on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:34] PROBLEM - Freshness of zerofetch successful run file on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:35] PROBLEM - Varnish traffic logger - varnishxcps on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:35] PROBLEM - Webrequests Varnishkafka log producer on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:42] PROBLEM - Varnish traffic logger - varnishreqstats on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:42] PROBLEM - Varnish HTCP daemon on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:42] PROBLEM - salt-minion processes on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:48:42] PROBLEM - SSH on cp3035 is CRITICAL: Server answer [21:48:42] PROBLEM - Varnish traffic logger - varnishxcache on cp3035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:49:32] RECOVERY - Confd vcl based reload on cp3035 is OK: reload-vcl successfully ran 1h, 20 minutes ago. [21:49:32] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp3035 is OK: No errors detected [21:49:32] RECOVERY - configured eth on cp3035 is OK: OK - interfaces up [21:49:32] RECOVERY - Disk space on cp3035 is OK: DISK OK [21:49:32] RECOVERY - Freshness of OCSP Stapling files on cp3035 is OK: OK [21:49:33] RECOVERY - dhclient process on cp3035 is OK: PROCS OK: 0 processes with command name dhclient [21:49:33] RECOVERY - Varnish traffic logger - varnishstatsd on cp3035 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [21:49:33] RECOVERY - Varnish traffic logger - varnishxcps on cp3035 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [21:49:34] RECOVERY - Freshness of zerofetch successful run file on cp3035 is OK: OK [21:49:35] RECOVERY - Webrequests Varnishkafka log producer on cp3035 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [21:49:42] RECOVERY - Varnish traffic logger - varnishreqstats on cp3035 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [21:49:42] RECOVERY - Varnish HTCP daemon on cp3035 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd [21:49:42] RECOVERY - salt-minion processes on cp3035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:49:42] RECOVERY - SSH on cp3035 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [21:49:42] RECOVERY - Varnish traffic logger - varnishxcache on cp3035 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcache, UID = 0 (root) [21:54:35] (03PS4) 10Madhuvishy: gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) [21:55:08] !log depooled cp3035 (memory issues - already schedule for FE restart to fix, which will repool when it's reached in the list...) [21:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:32] !log rebooting labservices1002 to mess with the bios [22:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:42] RECOVERY - Check systemd state on cp3035 is OK: OK - running: The system is fully operational [22:17:38] !log Deleted 2fa for user Mdann52 on wikitech after verifying account ownership via ssh file creation. T164804 [22:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:46] T164804: Remove 2FA on wikitech for user Mdann52 - https://phabricator.wikimedia.org/T164804 [22:20:22] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3245942 (10Volans) @Papaul thanks for letting me know. I understand the problem, given the particular nature of the `haze` host, although after a quick check I didn't see a way to get the //physical// location of the d... [22:22:42] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [22:24:09] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3245948 (10Papaul) @Volans not a bit deal just wanted you to know. Thanks. [22:28:54] (03PS1) 10Dzahn: nagios_common: move icinga contacts to Hiera top-level (debug) [puppet] - 10https://gerrit.wikimedia.org/r/352717 [22:29:48] (03PS2) 10Dzahn: nagios_common: move new icinga contacts to Hiera top-level (debug) [puppet] - 10https://gerrit.wikimedia.org/r/352717 [22:30:09] (03CR) 10Dzahn: [C: 032] "just debugging, does _not_ influence actual Icinga contacts" [puppet] - 10https://gerrit.wikimedia.org/r/352717 (owner: 10Dzahn) [22:37:43] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:47:43] PROBLEM - Host labservices1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:48:23] RECOVERY - Host labservices1002 is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms [22:52:39] jouncebot: now [22:52:39] For the next 0 hour(s) and 7 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T2100) [22:54:03] !log bd808@tin Started deploy [striker/deploy@00e8545]: openstack: Role modifications require global admin rights (T164787) [22:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:12] T164787: Keystone permissions exception when trying to approve Tool Labs membership request via Striker - https://phabricator.wikimedia.org/T164787 [22:54:20] dammit, usually an hour is enough downtime for a simple reboot [22:54:31] !log bd808@tin Finished deploy [striker/deploy@00e8545]: openstack: Role modifications require global admin rights (T164787) (duration: 00m 27s) [22:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T2300). [23:00:39] nothing to deploy? [23:01:59] looks like they crammed it all into morning swat today [23:02:17] (03CR) 10Volans: [C: 04-1] "Is there a task to discuss this? Code reviews are not the best place for more broader discussions and I have a bunch of questions about th" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [23:03:01] (03CR) 10Dzahn: "the task for this should be https://phabricator.wikimedia.org/T137928" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [23:03:24] (03CR) 10Dzahn: "well, or more specifically https://phabricator.wikimedia.org/T112776" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [23:03:37] (03PS3) 1020after4: phabricator: enable vcs and web user to run `git` and `ssh` via sudo [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) [23:03:41] I'm sure we can *find* something to deploy! [23:03:58] (03CR) 1020after4: [C: 031] "https://phabricator.wikimedia.org/T143175 is the repository clustering" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (https://phabricator.wikimedia.org/T143175) (owner: 1020after4) [23:07:15] RainbowSprinkles: what do we need to undo so that I can delete SMW forms on wikitech and make some redirects in their place? [23:07:41] Need to undo? [23:08:07] you locked us out of that remember [23:08:11] Ohhhh, lol [23:08:12] Yes [23:08:17] There's a hack in wikitech.php [23:08:35] / Disable creation of new Forms on wikitech, we're moving away from this [23:08:35] $wgNamespaceProtection[106] = 'nomorenewforms'; [23:08:42] Just comment that bit out ^ [23:08:54] *nod* [23:10:43] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:11:15] RainbowSprinkles: I just deployed the code today that will let me turn off the last actual use of SMW on wikitech. We could be bold and disable SMW tomorrow. [23:11:33] and not put it in the next wmf branch [23:11:40] 07Puppet, 10Beta-Cluster-Infrastructure, 10Phabricator, 13Patch-For-Review: puppet failure on deployment-phab01: Service[ssh-phab] refuses to start - https://phabricator.wikimedia.org/T147818#3246059 (10mmodell) I think it's ok for ssh-phab to fail in beta. [23:11:57] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3246062 (10mmodell) [23:12:01] 07Puppet, 10Beta-Cluster-Infrastructure, 10Phabricator, 13Patch-For-Review: puppet failure on deployment-phab01: Service[ssh-phab] refuses to start - https://phabricator.wikimedia.org/T147818#3246060 (10mmodell) 05Open>03Resolved a:03mmodell [23:12:10] bd808: Have we given edit permissions back again so we can delete more stuff? :P [23:12:12] 07Puppet, 10Beta-Cluster-Infrastructure, 10Phabricator, 13Patch-For-Review: puppet failure on deployment-phab01: Service[ssh-phab] refuses to start - https://phabricator.wikimedia.org/T147818#2704482 (10mmodell) a:05mmodell>03hashar [23:12:26] Reedy: he just told me how :) [23:12:38] I was thinking of making the patch and swatting it [23:12:43] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 5.285 seconds response time. www.wikipedia.org returns 208.80.154.224 [23:13:19] [00:13:10] (PS1) Reedy: Stop branching Semantic stuff for wikitech [tools/release] - https://gerrit.wikimedia.org/r/352720 [23:14:45] (03PS1) 10BryanDavis: Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) [23:14:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) (owner: 10BryanDavis) [23:16:47] (03PS2) 10BryanDavis: Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) [23:18:31] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added dummy striker creds [labs/private] - 10https://gerrit.wikimedia.org/r/352644 (owner: 10Andrew Bogott) [23:18:32] (03CR) 10BryanDavis: [C: 032] Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) (owner: 10BryanDavis) [23:19:30] (03Merged) 10jenkins-bot: Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) (owner: 10BryanDavis) [23:19:39] (03CR) 10jenkins-bot: Revert "Disable creation of new forms on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352721 (https://phabricator.wikimedia.org/T53642) (owner: 10BryanDavis) [23:21:22] (03PS1) 10Reedy: Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 [23:21:44] !log bd808@tin Synchronized wmf-config/wikitech.php: Revert "Disable creation of new forms on wikitech" (T53642) (duration: 01m 10s) [23:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:53] T53642: Get rid of SemanticMediaWiki/SRF/SF from wikitech.wikimedia.org - https://phabricator.wikimedia.org/T53642 [23:22:12] (03PS2) 10Reedy: Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) [23:23:30] DELETE ALL THE THINGS [23:24:29] \m/ O-O \m/ [23:26:28] waits for deployment of mw-unknown-version [23:29:03] no more SMW forms on wikitech [23:30:05] https://wikitech.wikimedia.org/w/index.php?title=Special:Properties&limit=500&offset=0 [23:30:10] Hmm... Quite a few still on forms... [23:30:14] Are we just gonna break them? :P [23:31:26] that's exactly what i was wondering when i saw someboday "no more smw" [23:31:50] well... what do we need to do? [23:31:59] the data is in the pages right? [23:32:06] I thought I killed the Semantic stuff from the {{Project}} template [23:32:13] Reedy, https://gerrit.wikimedia.org/r/352724 [23:32:25] eg: https://wikitech.wikimedia.org/wiki/Property:Project_state [23:32:35] MaxSem: I already made one of those :P [23:32:56] Granted, I don't know how the hell this thing even works [23:33:14] * bd808 is making edits to point to https://toolsadmin.wikimedia.org/tools/membership/apply instead of the old form [23:33:26] https://wikitech.wikimedia.org/wiki/Property:Shell_Justification [23:33:35] I think it's just the way it picks out parameters and stuff... [23:33:36] We *could* just kill things piecemeal, instead of all at once [23:33:39] Starting with Forms would be nice [23:33:47] (rather than breaking the world, potentially) [23:34:11] this world is cruel and unfair. down with it! [23:34:12] Reedy: those are SMW properties gathered from the old requests [23:34:54] A load of the data in the semantic tables is orphaned and won't refresh etc [23:35:01] yeah [23:35:24] Oh yeah, refreshing those tables... [23:35:28] I remember that bug [23:35:30] that template doesn't seem to have SMW markers in it any more -- https://wikitech.wikimedia.org/w/index.php?title=Template:Shell_Access_Request&action=edit [23:35:30] Modification date of type Date (11,963 uses) [23:36:20] Tools Request User Name of type Text (1,209 uses) [23:36:21] I just deleted Property:Project_state [23:36:24] the way that SMW works all the data is in the wiki pages. At worst we will need to edit a few more templates to take SMW cruft out of them [23:36:25] And the stuff using it is working fine [23:37:16] anybody know how to do a what links here for one of the forms I nuked :/ [23:37:39] Special:WhatLinksHere/Form:FormName? [23:37:42] bd808, SELECT * FROM ... [23:37:47] * bd808 can probably find things via sneaky cirrussearch backend magic [23:38:57] Reedy: its via Special:FormEdit so .. yeah [23:39:57] https://wikitech.wikimedia.org/w/index.php?title=Template%3ATools_Access_Request&type=revision&diff=1758704&oldid=1457533 [23:40:01] Should remove some redlinks [23:41:36] Yup, so https://wikitech.wikimedia.org/wiki/Shell_Request/$traight-$hoota isn't looking stupud now [23:42:41] Reedy: that [23:42:53] Fixing a few more [23:42:56] ugh that "[[Tools Request User Name::" is a SMW thingy too right [23:42:57] The job queue is gonna be lol [23:43:14] Seems to be for the property lists [23:43:25] * bd808 is hacking a grep tool [23:44:56] (03Abandoned) 10XXN: Fixes for namespace definitions for some Romanian (ro) projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321121 (owner: 10XXN) [23:45:41] (03Draft2) 10XXN: Fixing "Book_talk" namespace definition for ro.wikipedia: *Discuţie_Book => Book_talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 [23:47:54] RainbowSprinkles: what's the trick to add to a Special:Search query to see the json dump of the schema? [23:49:35] Ummm. [23:49:40] cirrusDumpQuery? [23:50:00] that will work [23:50:15] (03PS3) 10XXN: Fixing "Book_talk" namespace definition for ro.wikipedia: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352728 [23:52:52] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:55:19] ACKNOWLEDGEMENT - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn blame mutante [23:59:10] bd808: lol [23:59:17] the total edits on wikitech has dropped by about 400k [23:59:35] nice. [23:59:45] contentpages from 627,792 to 6,868 [23:59:55] I got my grep hack done. only 15 pages to edit it looks like