Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 2445 items:

2016-04-19 00:01:28 <icinga-wm> RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
2016-04-19 00:05:52 <icinga-wm> PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
2016-04-19 00:09:21 <wikibugs> Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216392 (Dzahn) created maint-announce@lists (for less confusion identical name but with .lists. ) https://lists.wikimedia.org/mailman/admin/maint-announce set archives to private added noc@ as admin, su...
2016-04-19 00:09:41 <wikibugs> Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216393 (Dzahn) p:Low>Normal
2016-04-19 00:13:02 <wikibugs> Operations, MediaWiki-General-or-Unknown, Traffic, Wikimedia-General-or-Unknown, HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2216425 (Reedy)
2016-04-19 00:13:08 <grrrit-wm> (PS1) BBlack: secure WMF-Last-Access cookie [puppet] - https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576)
2016-04-19 00:13:10 <grrrit-wm> (PS1) BBlack: secure CP cookie [puppet] - https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576)
2016-04-19 00:22:55 <grrrit-wm> (CR) Reedy: [C: ] secure WMF-Last-Access cookie [puppet] - https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576) (owner: BBlack)
2016-04-19 00:23:32 <grrrit-wm> (CR) Reedy: [C: ] secure CP cookie [puppet] - https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: BBlack)
2016-04-19 00:24:22 <wikibugs> Operations, MediaWiki-General-or-Unknown, Traffic, Wikimedia-General-or-Unknown, HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2216483 (Reedy)
2016-04-19 00:24:26 <wikibugs> Operations, Traffic, HTTPS, Patch-For-Review, Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2216482 (Reedy) Resolved>Open
2016-04-19 00:26:56 <grrrit-wm> (CR) BBlack: [C: 2] secure WMF-Last-Access cookie [puppet] - https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576) (owner: BBlack)
2016-04-19 00:29:57 <mutante> !log kraz.codfw.wmnet - initial install, adding to site
2016-04-19 00:30:01 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 00:31:59 <grrrit-wm> (PS1) BBlack: varnish redir: wmfusercontent.org -> www.wikimedia.org [puppet] - https://gerrit.wikimedia.org/r/284112 (https://phabricator.wikimedia.org/T132452)
2016-04-19 00:42:57 <mutante> !log kraz - signing puppet certs, adding salt keys
2016-04-19 00:43:01 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 00:44:39 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216505 (Dzahn) installed kraz.codfw.wmnet - added to puppet, salt, icinga, added mw-rc role
2016-04-19 00:47:30 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216508 (Dzahn) next we need: ::Ircserver/Service[ircd]: Provider upstart is not functional on this host
2016-04-19 00:52:27 <grrrit-wm> (PS2) Dzahn: interface: move rps::modparams to own file [puppet] - https://gerrit.wikimedia.org/r/284083
2016-04-19 00:56:54 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216516 (Dzahn) eh, and this needs a public IP, unlike antimony
2016-04-19 00:59:33 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216518 (Dzahn) a:Dzahn
2016-04-19 01:00:31 <grrrit-wm> (PS1) Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729)
2016-04-19 01:04:35 <grrrit-wm> (CR) Ori.livneh: "Krinkle: I agree. How do we do it, though? I am loathe to hard-code which wikis are in each group, so it either has to be available client" [puppet] - https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) (owner: Ori.livneh)
2016-04-19 01:07:16 <grrrit-wm> (PS1) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729)
2016-04-19 01:08:00 <icinga-wm> PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures
2016-04-19 01:09:35 <icinga-wm> ACKNOWLEDGEMENT - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn not fully installed yet
2016-04-19 01:11:01 <grrrit-wm> (CR) Alex Monk: kraz.codfw.wmnet -> kraz.wikimedia.org (1 comment) [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 01:11:41 <mutante> !log restbase2004 - unit cassandra-b is failed
2016-04-19 01:11:45 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 01:13:05 <grrrit-wm> (CR) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org (1 comment) [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 01:13:11 <grrrit-wm> (PS2) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729)
2016-04-19 01:13:22 <grrrit-wm> (CR) Krinkle: "argon, not antimony. Right?" [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 01:13:50 <grrrit-wm> (PS3) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729)
2016-04-19 01:14:16 <grrrit-wm> (CR) Dzahn: "yes, it's clearly not a good idea that i work on both at the same time and time to take a break :)" [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 01:14:29 <Krinkle> mutante: So what's the migration like?
2016-04-19 01:14:47 <Krinkle> We can have emit MediaWiki to both for a while, and then shut down the old one so that clients automatically reconnect to the old one at that point.
2016-04-19 01:14:56 <Krinkle> to the new one*
2016-04-19 01:14:56 <mutante> i don't know
2016-04-19 01:15:18 <mutante> but that sounds good
2016-04-19 01:15:23 <Krinkle> mutante: It is possible for sessions to remain on argon while we change irc.wikimedia.org do the new one, right?
2016-04-19 01:15:32 <mutante> no idea
2016-04-19 01:16:07 <Krinkle> I mean, the ability for irc to communicate doesn't relate to the hostname still resolving to the same IP, right? IRC only resolves the host when creationg the connection, not for each UDP packet.
2016-04-19 01:16:46 <Krinkle> once DNS roll out is complete (12 hours? 24 hours?) and all new clients use the new one and that one is working, we can shut it down and clients will just reconnect. Just like a reboot basically.
2016-04-19 01:17:00 <Krinkle> mutante: May wanna combine with the May 2nd deploument
2016-04-19 01:17:06 <Krinkle> which is when the next maintenance reboot is scheduled for irc
2016-04-19 01:17:10 <mutante> i don't know, i just installed a VM, nobody has talked about this
2016-04-19 01:17:25 <Krinkle> https://phabricator.wikimedia.org/T123729
2016-04-19 01:17:29 <mutante> tomorrow at least one will exist
2016-04-19 01:17:41 <mutante> then we need unit files
2016-04-19 01:17:44 <Krinkle> Well, as long as nobody is shutting down or upgrading argon or changing irc.wikimedia.org destination yet :)
2016-04-19 01:18:44 <ori> could even keep argon running until connections drop off naturally
2016-04-19 01:18:47 <mutante> i know that ticket but nothing about a scheduled reboot
2016-04-19 01:19:07 <Krinkle> mutante: https://phabricator.wikimedia.org/T122933
2016-04-19 01:19:19 <Krinkle> The next change (and restart) is May 2nd
2016-04-19 01:19:37 <mutante> aha
2016-04-19 01:19:38 <Krinkle> announced so that people are aware of, since normally reboots must not happen unannouned on irc.wikimedia.org
2016-04-19 01:20:22 <Krinkle> ori: Yeah, if there's no rush to upgrade argon, then letting it drain naturally over a few days would be preferable.
2016-04-19 01:20:43 <mutante> there is a rush to upgrade
2016-04-19 01:20:55 <Krinkle> how strongly?
2016-04-19 01:21:47 <Krinkle> We announced the restart on May 2nd in March (1.5 month heads up).
2016-04-19 01:21:57 <Krinkle> So it seems unwise to restart or upgrade before that.
2016-04-19 01:22:01 <mutante> i cant quantify it but when trying to kill precise you never get to it, because there is always one small reason to not rush
2016-04-19 01:22:08 <mutante> until forever
2016-04-19 01:23:26 <mutante> the plan wasnt to upgrade or restart though, it was to start a new one and shutdown the old one
2016-04-19 01:23:35 <Krinkle> We can have the new VM set up and accepting connections now, then configure MW to feed it, switch DNS, and after the deployment on May 2nd argon can just go down (rather than restart)
2016-04-19 01:23:42 <mutante> as far as you can call it a plan
2016-04-19 01:23:55 <mutante> Krinkle: sounds good to me
2016-04-19 01:23:56 <Krinkle> mutante: Sure, but transition or restart, either way interupts users.
2016-04-19 01:24:10 <Krinkle> and should be announced :)
2016-04-19 01:24:31 <Krinkle> The current May 2nd change actually only involves mw-config changes, no restart of the service.
2016-04-19 01:24:33 <mutante> i know, that is what makes people not touch it
2016-04-19 01:24:41 <mutante> ok
2016-04-19 01:25:03 <Krinkle> So we can't ride it silently, we'll need to send out a separate announcement that we'll also restart (or rather, move) the service to a different server.
2016-04-19 01:25:19 <Krinkle> Most of which we can do before May 2nd if it's non-disruptive.
2016-04-19 01:25:46 <Krinkle> presumably configuring MW and changing DNS can all be done without affecting persistent connections.
2016-04-19 01:26:22 <wikibugs> Operations, ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2216584 (Peachey88)
2016-04-19 01:26:58 <mutante> and long before that ...
2016-04-19 01:27:04 <mutante> the services would have to start
2016-04-19 01:31:21 <wikibugs> Operations, Traffic, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216597 (Dzahn) Any news from legal?
2016-04-19 01:31:42 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216598 (Dzahn)
2016-04-19 01:32:33 <icinga-wm> ACKNOWLEDGEMENT - gitblit process on furud is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar gitblit.jar daniel_zahn still upcoming
2016-04-19 01:33:50 <mutante> Apr 19 01:31:18 restbase2004 cassandra[11544]: Exception encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap w...t is true
2016-04-19 01:33:53 <mutante> Apr 19 01:31:18 restbase2004 cassandra[11544]: WARN 01:31:18 No local state or state is in silent shutdown, not announcing shutdown
2016-04-19 01:34:55 <mutante> !log restbase2004 - starting crashed cassandra-b service
2016-04-19 01:34:59 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 01:36:31 <grrrit-wm> (CR) Ori.livneh: [C: ] "the secure flag doesn't prevent the cookie value from being read by client-side javascript code, so this is fine." [puppet] - https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: BBlack)
2016-04-19 01:39:07 <wikibugs> Operations, RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2216604 (Dzahn)
2016-04-19 01:40:25 <icinga-wm> ACKNOWLEDGEMENT - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed daniel_zahn has been going on for many hours. notification issue ? - https://phabricator.wikimedia.org/T132999
2016-04-19 01:41:56 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216620 (Dzahn) a:Dzahn>None
2016-04-19 01:44:30 <icinga-wm> PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 01:44:49 <wikibugs> Operations, Need-volunteer: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326#2216635 (Dzahn) p:Triage>Low
2016-04-19 01:46:08 <wikibugs> Operations, Monitoring, Graphite: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801#2216637 (Dzahn)
2016-04-19 01:47:21 <wikibugs> Operations, RESTBase, Services, Traffic: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (BBlack)
2016-04-19 01:47:42 <wikibugs> Operations, RESTBase, Services, Traffic: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216653 (BBlack) p:Triage>Normal
2016-04-19 01:49:01 <grrrit-wm> (CR) BBlack: [C: 2] secure CP cookie [puppet] - https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: BBlack)
2016-04-19 02:02:59 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2091529 (jayvdb) Offtopic a little perhaps, but is anything being done wrt trademark violation of enwikipedia.org ? If wmf legal cant / wont take that domain to trib...
2016-04-19 02:05:46 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216663 (Peachey88) >>! In T128968#2216661, @jayvdb wrote: > Offtopic a little perhaps, but is anything being done wrt trademark violation of enwikipedia.org ? If wm...
2016-04-19 02:06:40 <urandom> !log systemctl mask cassandra-b on restbase2004.codfw.wmnet (it should not be running)
2016-04-19 02:06:45 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 02:10:20 <wikibugs> Operations, RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2216604 (Eevans) This node should not be running, it is administratively down; I'm not sure what happened that it started to send notifications now.
2016-04-19 02:11:21 <icinga-wm> RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 02:22:00 <logmsgbot> !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 09m 47s)
2016-04-19 02:22:06 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 02:31:46 <logmsgbot> !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 19 02:31:46 UTC 2016 (duration 9m 46s)
2016-04-19 02:31:51 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 02:36:58 <icinga-wm> PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 77 failures
2016-04-19 02:37:57 <icinga-wm> PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 02:42:50 <wikibugs> Operations, Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216681 (Krinkle) Proposed migration plan after discussing with @Dzahn and @ori on IRC: * Set up kraz (Jessie; VM) to be a replacement for argon (Precise; metal). * Update MediaWiki wmf-config to broadcast...
2016-04-19 02:43:49 <wikibugs> Operations, Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2216684 (Krinkle)
2016-04-19 02:44:07 <wikibugs> Operations, Patch-For-Review, developer-notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (Krinkle)
2016-04-19 02:44:59 <wikibugs> Operations, Patch-For-Review, developer-notice, notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (Krinkle)
2016-04-19 02:51:42 <grrrit-wm> (PS1) Jcrespo: Repool pc1006 and pc2006 [mediawiki-config] - https://gerrit.wikimedia.org/r/284123
2016-04-19 02:53:07 <grrrit-wm> (CR) Ori.livneh: [C: ] Repool pc1006 and pc2006 [mediawiki-config] - https://gerrit.wikimedia.org/r/284123 (owner: Jcrespo)
2016-04-19 02:54:21 <logmsgbot> !log ori@tin Synchronized php-1.27.0-wmf.21/includes/api/ApiStashEdit.php: Ie9799f5ea: Segment stash edit cache stats by basis for hit/miss (duration: 00m 39s)
2016-04-19 02:54:28 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 03:00:59 <Krenair> Krinkle, does argon really still need to identify the network as irc.pmtpa.wikimedia.org?
2016-04-19 03:02:49 <wikibugs> Operations, Patch-For-Review, developer-notice, notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2216717 (Krenair) >>! In T123729#2216681, @Krinkle wrote: > manually connect to kraz with IRC and verify e.g. `/join #en.wikipedia` and look for events....
2016-04-19 03:10:47 <grrrit-wm> (CR) Jcrespo: [C: 2] Repool pc1006 and pc2006 [mediawiki-config] - https://gerrit.wikimedia.org/r/284123 (owner: Jcrespo)
2016-04-19 03:12:32 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool pc2006 (duration: 00m 31s)
2016-04-19 03:12:36 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 03:13:17 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1006 (duration: 00m 28s)
2016-04-19 03:13:21 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 03:14:09 <jynus> "The MariaDB server is running with the --read-only option so it cannot execute this statement (10.192.16.170)"
2016-04-19 03:14:57 <jynus> /wiki/Main_Page
2016-04-19 03:59:38 <icinga-wm> PROBLEM - Disk space on restbase1014 is CRITICAL: DISK CRITICAL - free space: /srv 185631 MB (3% inode=99%)
2016-04-19 04:05:23 <grrrit-wm> (PS2) Krinkle: switchover: set mediawiki master datacenter to codfw [puppet] (switchover) - https://gerrit.wikimedia.org/r/282898 (owner: Giuseppe Lavagetto)
2016-04-19 04:14:14 <icinga-wm> PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time
2016-04-19 04:14:23 <icinga-wm> PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time
2016-04-19 04:16:23 <icinga-wm> RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 65907 bytes in 0.094 second response time
2016-04-19 04:16:24 <icinga-wm> RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time
2016-04-19 04:35:53 <icinga-wm> PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
2016-04-19 04:37:14 <icinga-wm> PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
2016-04-19 04:37:15 <icinga-wm> PROBLEM - Check size of conntrack table on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:37:34 <icinga-wm> PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:37:43 <icinga-wm> PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:38:33 <icinga-wm> PROBLEM - RAID on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:38:34 <icinga-wm> PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:38:44 <icinga-wm> PROBLEM - SSH on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
2016-04-19 04:39:35 <icinga-wm> RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
2016-04-19 04:40:43 <icinga-wm> RECOVERY - configured eth on mw1148 is OK: OK - interfaces up
2016-04-19 04:41:44 <icinga-wm> RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm
2016-04-19 04:42:23 <icinga-wm> PROBLEM - DPKG on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:44:54 <icinga-wm> PROBLEM - nutcracker port on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:45:34 <icinga-wm> PROBLEM - dhclient process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:45:54 <icinga-wm> PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:47:03 <icinga-wm> PROBLEM - Disk space on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:47:53 <icinga-wm> PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:47:55 <icinga-wm> PROBLEM - nutcracker process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 04:50:54 <icinga-wm> PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 05:01:24 <icinga-wm> RECOVERY - Disk space on mw1148 is OK: DISK OK
2016-04-19 05:01:44 <icinga-wm> RECOVERY - SSH on mw1148 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0)
2016-04-19 05:01:44 <icinga-wm> RECOVERY - nutcracker port on mw1148 is OK: TCP OK - 0.000 second response time on port 11212
2016-04-19 05:01:54 <icinga-wm> RECOVERY - DPKG on mw1148 is OK: All packages OK
2016-04-19 05:02:15 <icinga-wm> RECOVERY - RAID on mw1148 is OK: OK: no RAID installed
2016-04-19 05:02:25 <icinga-wm> RECOVERY - configured eth on mw1148 is OK: OK - interfaces up
2016-04-19 05:02:46 <icinga-wm> RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.052 second response time
2016-04-19 05:03:04 <icinga-wm> RECOVERY - Check size of conntrack table on mw1148 is OK: OK: nf_conntrack is 12 % full
2016-04-19 05:03:16 <icinga-wm> RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm
2016-04-19 05:03:34 <icinga-wm> RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 68330 bytes in 0.498 second response time
2016-04-19 05:04:05 <icinga-wm> RECOVERY - dhclient process on mw1148 is OK: PROCS OK: 0 processes with command name dhclient
2016-04-19 05:04:44 <icinga-wm> RECOVERY - nutcracker process on mw1148 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
2016-04-19 05:05:15 <icinga-wm> RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
2016-04-19 05:14:01 <wikibugs> Operations, Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2216853 (Nemo_bis) OTRS can be used almost like a mailing list, if all members of a queue set up the notifications for it. Then you have archives and triaging. I'm not saying it's a good soluti...
2016-04-19 05:16:22 <wikibugs> Operations, Labs, Tool-Labs, Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2216857 (Nemo_bis) By "this" I assume you mean the list above? I'd like more comments on the methods proposed in the description.
2016-04-19 05:22:07 <grrrit-wm> (PS1) Ori.livneh: Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - https://gerrit.wikimedia.org/r/284129
2016-04-19 05:32:48 <wikibugs> Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216872 (RobH) set to not announce itself on the mailing lists main page, as we wont allow anyone to subscribe to it. we should also set it to not allow anyone to post without moderation, and then we can wh...
2016-04-19 05:33:46 <icinga-wm> RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
2016-04-19 05:37:41 <grrrit-wm> (PS1) Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - https://gerrit.wikimedia.org/r/284130
2016-04-19 05:38:42 <grrrit-wm> (PS2) Ori.livneh: Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - https://gerrit.wikimedia.org/r/284129
2016-04-19 05:38:50 <grrrit-wm> (CR) jenkins-bot: [V: -1] ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - https://gerrit.wikimedia.org/r/284130 (owner: Ori.livneh)
2016-04-19 05:39:56 <grrrit-wm> (PS2) Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - https://gerrit.wikimedia.org/r/284130
2016-04-19 05:40:20 <grrrit-wm> (CR) Ori.livneh: [C: 2] Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - https://gerrit.wikimedia.org/r/284129 (owner: Ori.livneh)
2016-04-19 05:40:45 <grrrit-wm> (PS3) Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - https://gerrit.wikimedia.org/r/284130
2016-04-19 05:41:54 <grrrit-wm> (CR) Matanya: [C: ] "lgtm" [mediawiki-config] - https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: Eranroz)
2016-04-19 05:42:15 <grrrit-wm> (CR) Ori.livneh: [C: 2] ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - https://gerrit.wikimedia.org/r/284130 (owner: Ori.livneh)
2016-04-19 06:05:38 <grrrit-wm> (PS1) Ori.livneh: Revert "ganglia-web: don't send Content-Disposition header with JSON / CSV data" [puppet] - https://gerrit.wikimedia.org/r/284131
2016-04-19 06:05:57 <grrrit-wm> (CR) Ori.livneh: [C: 2 V: 2] Revert "ganglia-web: don't send Content-Disposition header with JSON / CSV data" [puppet] - https://gerrit.wikimedia.org/r/284131 (owner: Ori.livneh)
2016-04-19 06:29:06 <icinga-wm> PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail
2016-04-19 06:30:35 <icinga-wm> PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures
2016-04-19 06:30:44 <icinga-wm> PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:31:14 <icinga-wm> PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:31:24 <icinga-wm> PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:31:35 <icinga-wm> PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:31:44 <icinga-wm> PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 2 failures
2016-04-19 06:31:45 <icinga-wm> PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:32:05 <icinga-wm> PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail
2016-04-19 06:32:06 <icinga-wm> PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:32:35 <icinga-wm> PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 2 failures
2016-04-19 06:32:45 <icinga-wm> PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:32:54 <icinga-wm> PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures
2016-04-19 06:32:54 <icinga-wm> PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:34:35 <icinga-wm> PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 06:37:39 <wikibugs> Operations, Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2216915 (elukey)
2016-04-19 06:37:40 <wikibugs> Operations, Analytics, Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2216914 (elukey) Resolved>Open
2016-04-19 06:37:53 <wikibugs> Operations, Analytics, Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2195218 (elukey) Closed it too soon, I can see the root@ notifications again :( ``` elukey@cp4003:~$ ls /etc/logrotate.d/varnishkafka* /etc/logrotate....
2016-04-19 06:38:34 <icinga-wm> PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: puppet fail
2016-04-19 06:46:31 <grrrit-wm> (PS1) Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324)
2016-04-19 06:55:25 <icinga-wm> RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
2016-04-19 06:55:55 <icinga-wm> RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
2016-04-19 06:56:16 <icinga-wm> RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
2016-04-19 06:56:45 <icinga-wm> RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
2016-04-19 06:57:16 <icinga-wm> RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
2016-04-19 06:57:25 <icinga-wm> RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 06:57:25 <icinga-wm> RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
2016-04-19 06:57:26 <icinga-wm> RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
2016-04-19 06:57:34 <icinga-wm> RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
2016-04-19 06:58:05 <icinga-wm> RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
2016-04-19 06:58:15 <icinga-wm> RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
2016-04-19 06:58:25 <icinga-wm> RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 06:58:34 <icinga-wm> RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 06:58:39 <grrrit-wm> (PS4) Muehlenhoff: debdeploy: rename init.pp to master.pp to match class name [puppet] - https://gerrit.wikimedia.org/r/284082 (owner: Dzahn)
2016-04-19 06:58:55 <icinga-wm> RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 06:59:15 <icinga-wm> RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 07:03:16 <grrrit-wm> (CR) Muehlenhoff: [C: 2 V: 2] debdeploy: rename init.pp to master.pp to match class name [puppet] - https://gerrit.wikimedia.org/r/284082 (owner: Dzahn)
2016-04-19 07:03:35 <icinga-wm> RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
2016-04-19 07:32:19 <wikibugs> Operations, ops-codfw, Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2216978 (MoritzMuehlenhoff) I'd say let's either reimage it or drop it from site.pp until reimaged.
2016-04-19 07:34:14 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: ] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - https://gerrit.wikimedia.org/r/283979 (owner: Jcrespo)
2016-04-19 07:43:56 <grrrit-wm> (PS2) Muehlenhoff: Setup meitnerium as the jessie-based archiva host [puppet] - https://gerrit.wikimedia.org/r/283956
2016-04-19 08:03:01 <grrrit-wm> (PS2) Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324)
2016-04-19 08:16:30 <wikibugs> Operations, Labs, Tool-Labs, Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (valhallasw) > * make a list of tools.wmflabs.org URLs and test them all for unsecure resources with a simple URL fetching script; > * some sm...
2016-04-19 08:24:06 <wikibugs> Operations, ops-eqiad, hardware-requests: connect an external harddisk with >2TB space to stat1001 - https://phabricator.wikimedia.org/T132476#2217091 (elukey) stat1004 is a new server just created with tons of space: ``` elukey@stat1004:~$ df -h Filesystem Size Used Avail Use% Mounted on udev...
2016-04-19 08:31:33 <wikibugs> Operations, Labs, Tool-Labs, Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2217117 (Magnus) As a side note, I set up a VM for my PetScan tool: http://petscan.wmflabs.org/ This does not require http, but works for either, as...
2016-04-19 08:36:11 <wikibugs> Operations, Beta-Cluster-Infrastructure, Labs, Labs-Infrastructure: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#2217130 (hashar) p:High>Low
2016-04-19 08:40:37 <grrrit-wm> (PS1) Muehlenhoff: Blacklist usbip kernel modules [puppet] - https://gerrit.wikimedia.org/r/284138
2016-04-19 08:41:27 <grrrit-wm> (PS7) Jcrespo: Install pt-heartbeat-wikimedia on all relevant servers [puppet] - https://gerrit.wikimedia.org/r/283979
2016-04-19 08:43:31 <grrrit-wm> (CR) Jcrespo: [C: 2] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - https://gerrit.wikimedia.org/r/283979 (owner: Jcrespo)
2016-04-19 08:44:48 <wikibugs> Operations, Beta-Cluster-Infrastructure: HHVM core dumps in Beta Cluster - https://phabricator.wikimedia.org/T1259#2217138 (hashar) Open>Resolved I have deleted `/data/project/core` and created `/data/project/cores` which is where /proc/sys/kernel/core_pattern points to. Should be fine now.
2016-04-19 08:46:36 <grrrit-wm> (PS2) Volans: MariaDB: configurations for codfw as primary [puppet] - https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654)
2016-04-19 08:50:03 <grrrit-wm> (CR) Filippo Giunchedi: [C: ] shinken: Allow undefined data in graphite for disk space checks [puppet] - https://gerrit.wikimedia.org/r/283779 (https://phabricator.wikimedia.org/T111540) (owner: Alex Monk)
2016-04-19 08:50:04 <icinga-wm> PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:50:25 <icinga-wm> PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:50:30 <jynus> mmm
2016-04-19 08:50:51 <volans> jynus: I can take a look ^^
2016-04-19 08:51:05 <icinga-wm> PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:51:07 <jynus> no, it is my change
2016-04-19 08:51:26 <volans> yes, Exec[pt-heartbeat-kill]
2016-04-19 08:51:34 <icinga-wm> PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:51:34 <jynus> 'kill -TERM $(cat /var/run/pt-heartbeat.pid)' is not qualified and no path was specified. Please qualify the command or specify a path.
2016-04-19 08:51:36 <icinga-wm> PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:51:45 <icinga-wm> PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:51:45 <icinga-wm> PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:51:55 <icinga-wm> PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:52:15 <icinga-wm> PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:52:15 <icinga-wm> PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:52:34 <icinga-wm> PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:53:25 <icinga-wm> PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:53:25 <icinga-wm> PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:53:47 <icinga-wm> PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:53:55 <icinga-wm> PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:05 <icinga-wm> PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:05 <icinga-wm> PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:07 <icinga-wm> PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:15 <icinga-wm> PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:15 <icinga-wm> PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:15 <icinga-wm> PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:24 <icinga-wm> PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:54:34 <icinga-wm> PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:55:04 <icinga-wm> PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:56:06 <icinga-wm> PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:56:15 <icinga-wm> PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:56:55 <icinga-wm> PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:56:57 <icinga-wm> PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:56:57 <icinga-wm> PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:57:14 <icinga-wm> PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:57:15 <icinga-wm> PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:57:21 <grrrit-wm> (PS1) Jcrespo: Add full path for kill command [puppet/mariadb] - https://gerrit.wikimedia.org/r/284139
2016-04-19 08:57:35 <icinga-wm> PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:57:45 <icinga-wm> PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 08:57:45 <icinga-wm> PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:57:45 <icinga-wm> PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:58:00 <grrrit-wm> (CR) Volans: [C: ] Add full path for kill command [puppet/mariadb] - https://gerrit.wikimedia.org/r/284139 (owner: Jcrespo)
2016-04-19 08:58:05 <icinga-wm> PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:58:16 <icinga-wm> PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:58:40 <grrrit-wm> (CR) Jcrespo: [C: 2] Add full path for kill command [puppet/mariadb] - https://gerrit.wikimedia.org/r/284139 (owner: Jcrespo)
2016-04-19 08:58:46 <icinga-wm> PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:58:55 <icinga-wm> PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:58:59 <grrrit-wm> (PS1) Gehel: WIP - Use unicast instead of multicast for Elasticsearch node communication [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236)
2016-04-19 08:59:24 <icinga-wm> PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:59:25 <icinga-wm> PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:59:35 <icinga-wm> PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: puppet fail
2016-04-19 08:59:54 <icinga-wm> PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:00:15 <icinga-wm> PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:00:34 <icinga-wm> PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:00:44 <icinga-wm> PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:00:57 <grrrit-wm> (PS1) Jcrespo: Correct mariadb error due to missing full patch [puppet] - https://gerrit.wikimedia.org/r/284141
2016-04-19 09:01:04 <icinga-wm> PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:05 <icinga-wm> PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:08 <grrrit-wm> (PS2) Jcrespo: Correct mariadb error due to missing full patch [puppet] - https://gerrit.wikimedia.org/r/284141
2016-04-19 09:01:14 <icinga-wm> PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:25 <icinga-wm> PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:36 <icinga-wm> PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:36 <icinga-wm> PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:01:50 <grrrit-wm> (PS2) Filippo Giunchedi: graphite: add graphite1003 [puppet] - https://gerrit.wikimedia.org/r/283989
2016-04-19 09:01:58 <grrrit-wm> (CR) Filippo Giunchedi: [C: 2 V: 2] graphite: add graphite1003 [puppet] - https://gerrit.wikimedia.org/r/283989 (owner: Filippo Giunchedi)
2016-04-19 09:02:02 <icinga-wm> PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:02:04 <icinga-wm> PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:02:12 <icinga-wm> PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:02:22 <icinga-wm> PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:02:22 <icinga-wm> PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:02:24 <grrrit-wm> (PS3) Jcrespo: Correct mariadb error due to missing full patch [puppet] - https://gerrit.wikimedia.org/r/284141
2016-04-19 09:03:03 <icinga-wm> PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:03:13 <icinga-wm> PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:03:33 <icinga-wm> PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:03:33 <icinga-wm> PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:03:34 <icinga-wm> PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 09:05:03 <icinga-wm> PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:05:22 <icinga-wm> PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:05:23 <icinga-wm> PROBLEM - puppet last run on es2017 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:05:23 <icinga-wm> PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:05:32 <icinga-wm> PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:06:42 <godog> !log stop compactions on restbase1014
2016-04-19 09:06:43 <icinga-wm> PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:06:46 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 09:06:52 <icinga-wm> RECOVERY - Disk space on restbase1014 is OK: DISK OK
2016-04-19 09:07:12 <icinga-wm> PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:12 <icinga-wm> PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:13 <icinga-wm> PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:32 <icinga-wm> PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:32 <icinga-wm> PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:33 <icinga-wm> PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:07:33 <icinga-wm> PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:08:03 <icinga-wm> PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:08:03 <icinga-wm> PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:08:23 <icinga-wm> PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:08:23 <icinga-wm> PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:08:42 <icinga-wm> PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 09:09:02 <icinga-wm> PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:09:03 <icinga-wm> PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:09:22 <icinga-wm> PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:09:32 <icinga-wm> PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:09:33 <icinga-wm> PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:12 <icinga-wm> PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:23 <icinga-wm> PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:32 <icinga-wm> PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:52 <icinga-wm> PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:52 <icinga-wm> PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:10:53 <icinga-wm> PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:11:02 <icinga-wm> PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:11:52 <grrrit-wm> (CR) Muehlenhoff: WIP - Use unicast instead of multicast for Elasticsearch node communication (1 comment) [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: Gehel)
2016-04-19 09:11:53 <icinga-wm> PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:11:55 <godog> !log stop cassandra and restbase on restbase1006
2016-04-19 09:11:59 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 09:12:02 <icinga-wm> PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:12:12 <icinga-wm> PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:12:44 <icinga-wm> PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:12:52 <icinga-wm> PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:12:52 <icinga-wm> PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:13:13 <icinga-wm> PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:13:22 <icinga-wm> PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:13:42 <icinga-wm> PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:14:12 <icinga-wm> PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:14:33 <icinga-wm> PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:14:53 <icinga-wm> PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:15:04 <grrrit-wm> (CR) Jcrespo: [C: 2] Correct mariadb error due to missing full patch [puppet] - https://gerrit.wikimedia.org/r/284141 (owner: Jcrespo)
2016-04-19 09:15:23 <icinga-wm> PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:16:03 <icinga-wm> PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:16:23 <icinga-wm> PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:16:32 <godog> !log shutdown restbase100[56]
2016-04-19 09:16:33 <icinga-wm> PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:16:36 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 09:17:23 <icinga-wm> RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
2016-04-19 09:17:23 <icinga-wm> PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:17:43 <icinga-wm> RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
2016-04-19 09:17:53 <icinga-wm> PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:17:53 <icinga-wm> RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
2016-04-19 09:17:53 <icinga-wm> PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:18:02 <icinga-wm> RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
2016-04-19 09:18:02 <icinga-wm> PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:18:14 <icinga-wm> PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:18:34 <icinga-wm> PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:18:34 <icinga-wm> RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
2016-04-19 09:19:04 <icinga-wm> RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:19:04 <icinga-wm> PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:19:13 <icinga-wm> RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:19:13 <icinga-wm> RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:19:24 <icinga-wm> RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
2016-04-19 09:19:33 <icinga-wm> RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:19:43 <icinga-wm> RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
2016-04-19 09:19:52 <icinga-wm> RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:20:02 <icinga-wm> RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
2016-04-19 09:20:02 <icinga-wm> RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:20:02 <icinga-wm> PROBLEM - puppet last run on db2008 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:22 <icinga-wm> PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:23 <icinga-wm> PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:32 <icinga-wm> PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:32 <icinga-wm> RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
2016-04-19 09:20:32 <icinga-wm> RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
2016-04-19 09:20:32 <icinga-wm> PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:43 <icinga-wm> PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail
2016-04-19 09:20:53 <icinga-wm> RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:21:02 <icinga-wm> RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:21:33 <icinga-wm> RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
2016-04-19 09:21:44 <icinga-wm> RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:21:59 <wikibugs> Operations, ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2217272 (fgiunchedi) thanks @papaul, rows should be B/C/D, one in each, if possible not located in the same rack as existing restbase systems
2016-04-19 09:22:12 <icinga-wm> RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:22:12 <icinga-wm> RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
2016-04-19 09:22:24 <icinga-wm> RECOVERY - puppet last run on db1074 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
2016-04-19 09:22:43 <icinga-wm> RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
2016-04-19 09:22:52 <icinga-wm> RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
2016-04-19 09:23:13 <icinga-wm> RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:23:22 <icinga-wm> RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
2016-04-19 09:23:23 <icinga-wm> RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
2016-04-19 09:23:33 <icinga-wm> RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
2016-04-19 09:23:53 <icinga-wm> RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:23:53 <icinga-wm> RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:24:02 <icinga-wm> RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
2016-04-19 09:24:13 <icinga-wm> RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:24:32 <icinga-wm> RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
2016-04-19 09:24:54 <icinga-wm> RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:25:12 <icinga-wm> RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
2016-04-19 09:25:34 <icinga-wm> RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
2016-04-19 09:25:42 <icinga-wm> RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
2016-04-19 09:25:42 <icinga-wm> RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:26:03 <icinga-wm> RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
2016-04-19 09:26:03 <icinga-wm> RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:26:18 <grrrit-wm> (PS2) Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236)
2016-04-19 09:26:33 <icinga-wm> RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
2016-04-19 09:26:33 <icinga-wm> RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
2016-04-19 09:27:03 <icinga-wm> RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:27:03 <icinga-wm> RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:27:33 <icinga-wm> RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:27:34 <wikibugs> Operations, Discovery, Discovery-Search-Backlog, Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2217283 (Gehel) Seems that 2 cluster restart are required to enable this change. Let's wait until the datacen...
2016-04-19 09:27:42 <icinga-wm> RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:27:43 <icinga-wm> RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
2016-04-19 09:28:02 <icinga-wm> RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
2016-04-19 09:28:02 <icinga-wm> RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:28:02 <icinga-wm> RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
2016-04-19 09:28:03 <icinga-wm> RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:28:04 <icinga-wm> RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:28:20 <wikibugs> Operations, Discovery, Discovery-Search-Backlog, Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2217303 (Gehel) a:Gehel
2016-04-19 09:28:42 <icinga-wm> RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
2016-04-19 09:28:43 <icinga-wm> RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
2016-04-19 09:28:43 <icinga-wm> RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
2016-04-19 09:29:54 <icinga-wm> RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
2016-04-19 09:30:32 <icinga-wm> RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
2016-04-19 09:30:42 <icinga-wm> RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:30:43 <icinga-wm> RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:30:53 <icinga-wm> RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:30:53 <icinga-wm> RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:30:53 <icinga-wm> RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
2016-04-19 09:30:53 <icinga-wm> RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
2016-04-19 09:31:33 <icinga-wm> RECOVERY - puppet last run on es2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:32:51 <grrrit-wm> (PS1) Volans: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699)
2016-04-19 09:32:57 <icinga-wm> RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:33:16 <icinga-wm> RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
2016-04-19 09:33:16 <icinga-wm> RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
2016-04-19 09:33:28 <icinga-wm> RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:33:47 <icinga-wm> RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
2016-04-19 09:34:06 <icinga-wm> RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
2016-04-19 09:34:06 <icinga-wm> RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:34:36 <icinga-wm> RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:34:46 <icinga-wm> RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
2016-04-19 09:35:38 <icinga-wm> RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:35:46 <icinga-wm> RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
2016-04-19 09:35:46 <icinga-wm> RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
2016-04-19 09:35:47 <icinga-wm> RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
2016-04-19 09:35:58 <icinga-wm> RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
2016-04-19 09:36:00 <grrrit-wm> (CR) Volans: MariaDB: set codfw local masters as masters (s1-s7) (1 comment) [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: Volans)
2016-04-19 09:36:17 <icinga-wm> RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
2016-04-19 09:36:17 <icinga-wm> RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
2016-04-19 09:36:47 <grrrit-wm> (PS1) Filippo Giunchedi: cassandra: remove restbase100[56] [puppet] - https://gerrit.wikimedia.org/r/284145
2016-04-19 09:36:56 <icinga-wm> RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:36:56 <icinga-wm> RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
2016-04-19 09:37:16 <icinga-wm> RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:37:17 <icinga-wm> RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:37:27 <icinga-wm> RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:37:56 <icinga-wm> RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:38:06 <icinga-wm> RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
2016-04-19 09:38:06 <icinga-wm> RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
2016-04-19 09:38:16 <icinga-wm> RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
2016-04-19 09:38:27 <icinga-wm> RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
2016-04-19 09:38:35 <grrrit-wm> (PS2) Volans: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699)
2016-04-19 09:38:38 <icinga-wm> RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
2016-04-19 09:38:47 <icinga-wm> RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:38:48 <icinga-wm> RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
2016-04-19 09:38:52 <grrrit-wm> (PS2) Filippo Giunchedi: cassandra: remove restbase100[56] [puppet] - https://gerrit.wikimedia.org/r/284145 (https://phabricator.wikimedia.org/T125842)
2016-04-19 09:38:57 <icinga-wm> RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:39:08 <icinga-wm> RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
2016-04-19 09:39:27 <icinga-wm> RECOVERY - puppet last run on db2048 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
2016-04-19 09:39:37 <icinga-wm> RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
2016-04-19 09:39:38 <icinga-wm> RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
2016-04-19 09:39:46 <icinga-wm> RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:39:57 <icinga-wm> RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
2016-04-19 09:40:02 <grrrit-wm> (PS1) Filippo Giunchedi: remove restbase100[56] [dns] - https://gerrit.wikimedia.org/r/284146 (https://phabricator.wikimedia.org/T125842)
2016-04-19 09:41:17 <icinga-wm> RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
2016-04-19 09:43:17 <icinga-wm> RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:43:47 <icinga-wm> RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:44:06 <icinga-wm> RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
2016-04-19 09:44:08 <icinga-wm> RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:44:16 <icinga-wm> RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:44:27 <icinga-wm> RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:44:58 <icinga-wm> RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
2016-04-19 09:45:06 <icinga-wm> RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
2016-04-19 09:45:17 <icinga-wm> RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:17 <icinga-wm> RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
2016-04-19 09:45:27 <icinga-wm> RECOVERY - puppet last run on db2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:27 <icinga-wm> RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:36 <icinga-wm> RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:37 <icinga-wm> RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:57 <icinga-wm> RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:45:57 <icinga-wm> RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
2016-04-19 09:46:37 <icinga-wm> RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:47:16 <icinga-wm> RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
2016-04-19 09:47:56 <icinga-wm> RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 09:48:03 <Amir1> It's exploding
2016-04-19 09:48:07 <Amir1> :D
2016-04-19 09:48:20 <Amir1> (the channel I mean)
2016-04-19 09:48:27 <grrrit-wm> (PS1) Jcrespo: Update dns records to get the current state [dns] - https://gerrit.wikimedia.org/r/284148
2016-04-19 09:52:29 <grrrit-wm> (CR) Volans: "Results from puppet compiler: https://puppet-compiler.wmflabs.org/2501/"; [puppet] - https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: Volans)
2016-04-19 09:55:49 <grrrit-wm> (CR) Filippo Giunchedi: [C: 2 V: 2] cassandra: remove restbase100[56] [puppet] - https://gerrit.wikimedia.org/r/284145 (https://phabricator.wikimedia.org/T125842) (owner: Filippo Giunchedi)
2016-04-19 09:59:09 <grrrit-wm> (CR) Filippo Giunchedi: [C: 2 V: 2] remove restbase100[56] [dns] - https://gerrit.wikimedia.org/r/284146 (https://phabricator.wikimedia.org/T125842) (owner: Filippo Giunchedi)
2016-04-19 10:01:03 <wikibugs> Operations, RESTBase, Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2217422 (fgiunchedi) @Cmjohnson I've deprovisioned restbase1005 and restbase1006 and both are shutdown, should be enough disks to get restbase1015 going now, thanks!
2016-04-19 10:03:29 <grrrit-wm> (PS2) Jcrespo: Update dns records to get the current state [dns] - https://gerrit.wikimedia.org/r/284148
2016-04-19 10:05:23 <grrrit-wm> (CR) Jcrespo: [C: 2] Update dns records to get the current state [dns] - https://gerrit.wikimedia.org/r/284148 (owner: Jcrespo)
2016-04-19 10:07:05 <icinga-wm> ACKNOWLEDGEMENT - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi cassandra-b masked
2016-04-19 10:07:54 <jynus> !log updated dns entries about mysql masters
2016-04-19 10:07:58 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 10:10:47 <grrrit-wm> (CR) Volans: "Puppet changes here: https://puppet-compiler.wmflabs.org/2502/"; [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: Volans)
2016-04-19 10:13:44 <icinga-wm> PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
2016-04-19 10:15:44 <icinga-wm> RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5010135 keys - replication_delay is 0
2016-04-19 10:17:39 <grrrit-wm> (PS3) Muehlenhoff: Setup meitnerium as the jessie-based archiva host [puppet] - https://gerrit.wikimedia.org/r/283956
2016-04-19 10:20:02 <grrrit-wm> (CR) Muehlenhoff: [C: 2 V: 2] Setup meitnerium as the jessie-based archiva host [puppet] - https://gerrit.wikimedia.org/r/283956 (owner: Muehlenhoff)
2016-04-19 10:27:27 <wikibugs> Operations, Analytics-Cluster, Analytics-Kanban, Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2217463 (elukey) Open>Resolved
2016-04-19 10:27:29 <wikibugs> Operations, hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2217464 (elukey)
2016-04-19 10:34:58 <grrrit-wm> (CR) Gehel: Use unicast instead of multicast for Elasticsearch node communication (1 comment) [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: Gehel)
2016-04-19 10:43:25 <grrrit-wm> (PS1) Ori.livneh: Configure ganglia-web to cache data in a location it can actually write to [puppet] - https://gerrit.wikimedia.org/r/284155
2016-04-19 10:43:28 <ori> ^ paravoid
2016-04-19 10:43:56 <paravoid> lol
2016-04-19 10:44:58 <grrrit-wm> (CR) Faidon Liambotis: [C: 2] Configure ganglia-web to cache data in a location it can actually write to [puppet] - https://gerrit.wikimedia.org/r/284155 (owner: Ori.livneh)
2016-04-19 10:47:06 <grrrit-wm> (CR) QChris: [C: 2] "Yup, that deb looks good." [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 10:55:26 <grrrit-wm> (PS1) Jcrespo: Set codfw databases in read-write [mediawiki-config] - https://gerrit.wikimedia.org/r/284157
2016-04-19 10:56:39 <grrrit-wm> (CR) Ori.livneh: [C: ] Set codfw databases in read-write [mediawiki-config] - https://gerrit.wikimedia.org/r/284157 (owner: Jcrespo)
2016-04-19 11:20:34 <grrrit-wm> (PS1) Muehlenhoff: rcstream: Update source range [puppet] - https://gerrit.wikimedia.org/r/284161
2016-04-19 11:21:50 <grrrit-wm> (CR) jenkins-bot: [V: -1] rcstream: Update source range [puppet] - https://gerrit.wikimedia.org/r/284161 (owner: Muehlenhoff)
2016-04-19 11:22:28 <wikibugs> Operations, Analytics, Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2217535 (BBlack) I've killed the rest of them, I think. I'll let you confirm->close this time :)
2016-04-19 11:29:06 <grrrit-wm> (PS2) Muehlenhoff: rcstream: Update source range [puppet] - https://gerrit.wikimedia.org/r/284161
2016-04-19 11:33:04 <volans> !log changing binlog_format to STATEMENT for codfw masters for shards s1-s7 T124699
2016-04-19 11:33:05 <stashbot> T124699: Change configuration to make codfw db masters as the masters of all datacenters - https://phabricator.wikimedia.org/T124699
2016-04-19 11:33:08 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 11:51:46 <grrrit-wm> (PS3) Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236)
2016-04-19 11:58:31 <grrrit-wm> (CR) Muehlenhoff: Use unicast instead of multicast for Elasticsearch node communication (1 comment) [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: Gehel)
2016-04-19 12:03:28 <grrrit-wm> (CR) Gehel: Use unicast instead of multicast for Elasticsearch node communication (1 comment) [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: Gehel)
2016-04-19 12:04:01 <wikibugs> Operations, Mail, OTRS, WMDE-Fundraising-Software: add WMDE mx's to SpamAssassin trusted hosts to fix SPF softfails - https://phabricator.wikimedia.org/T83499#2217676 (JanZerebecki) Open>Resolved a:JanZerebecki As shown above it might now be fixed. If my memory serves me right @silke...
2016-04-19 12:07:00 <grrrit-wm> (PS4) Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236)
2016-04-19 12:09:42 <grrrit-wm> (CR) Muehlenhoff: "@QChris: Which version of bouncycastle is now required?" [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:12:31 <grrrit-wm> (CR) QChris: "> @QChris: Which version of bouncycastle is now required?" [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:13:50 <grrrit-wm> (CR) Paladox: "@QChris needs v+2 please." [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:15:38 <grrrit-wm> (CR) QChris: "> @QChris needs v+2 please." [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:16:56 <wikibugs> Operations, hardware-requests: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2217707 (fgiunchedi) a:Cmjohnson>fgiunchedi setting this up with jessie now
2016-04-19 12:18:17 <grrrit-wm> (CR) Paladox: "Ok ok but I thought Jenkins would need to be re run for c+2 to take affect meaning Jenkins runs in gate and submit." [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:25:13 <grrrit-wm> (CR) Addshore: "@paladox not if there is no gate and submit on this repo" [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 12:28:37 <icinga-wm> PROBLEM - MariaDB Slave SQL: s6 on db2046 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1449, Errmsg: Error The user specified as a definer (root@208.80.154.151) does not exist on query. Default database: ruwiki. Query: DELETE /* GeoData\Hooks::doLinksUpdate 127.0.0.1 */ FROM geo_tags WHERE gt_id = 81263668
2016-04-19 12:28:46 <volans> on it^^^
2016-04-19 12:29:17 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.34 seconds
2016-04-19 13:03:48 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.47 seconds
2016-04-19 13:04:08 <icinga-wm> RECOVERY - MariaDB Slave SQL: s6 on db2046 is OK: OK slave_sql_state Slave_SQL_Running: Yes
2016-04-19 13:17:49 <ori> Wikimedia Platform operations, serious stuff | Status: Up; codfw switchover / read-only at 14:00 UTC | Log: https://bit.ly/wikitech | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: akosiaris
2016-04-19 13:20:57 <c> ori: how long is read-only? btw
2016-04-19 13:21:42 <_joe_> c: the least amount of time possible
2016-04-19 13:21:55 <_joe_> we hope to do it within 30 minutes
2016-04-19 13:22:15 <grrrit-wm> (PS5) Addshore: WIP DRAFT WMDE_Analytics module [puppet] - https://gerrit.wikimedia.org/r/269467
2016-04-19 13:22:38 <ori> addshore: WIP DRAFT CALM DOWN J/K WMDE_Analytics module
2016-04-19 13:22:45 <addshore> :P
2016-04-19 13:22:49 <_joe_> lol
2016-04-19 13:22:56 <jynus> should we do read-only at 14? or whenever it hits?
2016-04-19 13:23:08 <addshore> heh, accidently published that instead of actually keeping it a draft ;)
2016-04-19 13:23:13 <hoo> WIP DNM NOT FINISHED YET DRAFT WMDE ANALYTICS DNM!!!
2016-04-19 13:23:29 <_joe_> jynus: we should do read-only when we're ready, it would be better if we're ready by 14:00
2016-04-19 13:24:10 <_joe_> so I guess the first items on the list should be done a few minutes earlier
2016-04-19 13:24:12 <mark> the later of 14:00 and being ready :)
2016-04-19 13:24:14 <mark> yes
2016-04-19 13:24:15 <jynus> so, the actual question was, should we start-not-so user impactin changes before that time?
2016-04-19 13:24:17 <_joe_> that's me and ori
2016-04-19 13:24:19 <mark> yes
2016-04-19 13:24:33 <jynus> ah, you answered before I ended up my question :-)
2016-04-19 13:25:18 <ori> _joe_: I'll stop the jobrunners in eqiad at 13:55
2016-04-19 13:25:25 <_joe_> cool
2016-04-19 13:25:52 <grrrit-wm> (PS1) Giuseppe Lavagetto: switchover: stop jobrunners in eqiad [puppet] - https://gerrit.wikimedia.org/r/284181
2016-04-19 13:26:28 <grrrit-wm> (PS1) Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] - https://gerrit.wikimedia.org/r/284182
2016-04-19 13:27:30 <grrrit-wm> (CR) Hoo man: [C: -1] "I still don't think we should have inline queries, but that's a PM decision in the end." [mediawiki-config] - https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: Yurik)
2016-04-19 13:31:44 <subbu> _joe_, if you remove your -2 on https://gerrit.wikimedia.org/r/#/c/282904/ I'll +2 that.
2016-04-19 13:32:06 <_joe_> subbu: we should deploy it when we've switched over mediawiki though
2016-04-19 13:32:12 <subbu> yes.
2016-04-19 13:32:17 <wikibugs> Operations, ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2217917 (Ottomata) a:Ottomata>elukey
2016-04-19 13:32:20 <subbu> i'll wait for a go before deploying.
2016-04-19 13:32:52 <wikibugs> Operations, ops-codfw, DC-Ops, EventBus, and 3 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2217919 (Ottomata) a:Ottomata>elukey
2016-04-19 13:33:32 <wikibugs> Operations, ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (Joe) @elukey don't install etcd on these machines for now, we need to come up with a good plan for that.
2016-04-19 13:40:31 <paravoid> so, we'll follow https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki-related
2016-04-19 13:40:32 <grrrit-wm> (PS1) Jcrespo: Depool one db server from each shard as a backup [mediawiki-config] - https://gerrit.wikimedia.org/r/284183
2016-04-19 13:40:53 <paravoid> with 0. being what jynus is doing now :P
2016-04-19 13:41:30 <jynus> sanity check here^, I need
2016-04-19 13:41:47 <paravoid> I'll follow the script and announce every step here
2016-04-19 13:41:52 <_joe_> paravoid: ok
2016-04-19 13:42:11 <_joe_> paravoid: some steps can be done in parallel, like 2 and 3
2016-04-19 13:43:02 <paravoid> yup
2016-04-19 13:43:03 <_joe_> and 7,8,9 and 10 as well
2016-04-19 13:43:08 <_joe_> and then 11,12
2016-04-19 13:43:30 <paravoid> (7) is lacking the execution step btw
2016-04-19 13:43:33 <MarkTraceur> shuffles into his seat with a big foam finger
2016-04-19 13:43:39 <_joe_> sorry, 11 and 12
2016-04-19 13:43:48 <ori> we're not going to warm up memcached (step 1b) only to wipe it in step 6, right? I think step 1 should exclude memcached
2016-04-19 13:43:55 <jynus> yes
2016-04-19 13:43:59 <jynus> that is obsolete now
2016-04-19 13:44:02 <_joe_> paravoid: 7 has instructions
2016-04-19 13:44:03 <jynus> with replication
2016-04-19 13:44:03 <ori> editing
2016-04-19 13:44:13 <jynus> I mean, we could do it
2016-04-19 13:44:25 <jynus> but I am not going to do it because it will have 0 impact
2016-04-19 13:44:32 <jynus> and actually, will probably fail
2016-04-19 13:44:34 <_joe_> paravoid: 6 hasn't, but me and ori covered it, I can add the command there
2016-04-19 13:45:06 <paravoid> I meant 8
2016-04-19 13:45:07 <jynus> so monitoring may be out of sync with actual commands
2016-04-19 13:45:15 <paravoid> (someone added 6 in the meantime :)
2016-04-19 13:45:25 <paravoid> I meant the parsoid deploy, to be more clear
2016-04-19 13:45:29 <jynus> so do not freak out if we start to see replication lag issues
2016-04-19 13:45:32 <godog> I'm adding salt commands for 9 (imagescalers)
2016-04-19 13:45:49 <jynus> as nagios may be updated asnyncly
2016-04-19 13:46:02 <_joe_> paravoid: subbu is going to deploy parsoid
2016-04-19 13:46:14 <grrrit-wm> (CR) Jcrespo: [C: 2] Depool one db server from each shard as a backup [mediawiki-config] - https://gerrit.wikimedia.org/r/284183 (owner: Jcrespo)
2016-04-19 13:47:23 <jynus> deploying the change now
2016-04-19 13:47:27 <jynus> wait for the log:
2016-04-19 13:47:29 <wikibugs> Operations, Continuous-Integration-Scaling: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#2217938 (hashar) Open>Resolved a:hashar Got solved/agreed etc and we eventually have Nodepool installed on a machine in the labs support network.
2016-04-19 13:47:37 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool one db per shard as a backup (duration: 00m 27s)
2016-04-19 13:47:41 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 13:48:08 <icinga-wm> PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: puppet fail
2016-04-19 13:49:28 <bblack> ^ catalog fail
2016-04-19 13:49:46 <_joe_> ?
2016-04-19 13:49:53 <_joe_> so puppetmaster issue?
2016-04-19 13:50:02 <bblack> I'm not 100% sure, when that happens. probably
2016-04-19 13:50:16 <bblack> I'm in readonly mode right now though, I don't want to run the agent there and step on anyone
2016-04-19 13:50:43 <_joe_> so, ori, I'm merging the puppet changes in the repo now
2016-04-19 13:50:54 <paravoid> which puppet changes?
2016-04-19 13:51:00 <ori> disabling jobrunners in eqiad
2016-04-19 13:51:14 <_joe_> paravoid: points 2 and 3
2016-04-19 13:51:17 <ori> and block maintenance scripts from running in eqiad
2016-04-19 13:51:24 <ori> sounds good to me
2016-04-19 13:51:45 <grrrit-wm> (PS7) BBlack: acme-setup script + acme::init [puppet] - https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812)
2016-04-19 13:52:01 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: 2] switchover: stop jobrunners in eqiad [puppet] - https://gerrit.wikimedia.org/r/284181 (owner: Giuseppe Lavagetto)
2016-04-19 13:52:20 <grrrit-wm> (PS2) Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] - https://gerrit.wikimedia.org/r/284182
2016-04-19 13:52:58 <ori> i'll wait until :55 exactly to salt the service stop command
2016-04-19 13:53:04 <paravoid> yes please
2016-04-19 13:53:13 <_joe_> yes that was the idea
2016-04-19 13:53:20 <paravoid> I'll !log to signal the commence of the rollout
2016-04-19 13:53:22 <_joe_> I'll puppet-merge at that moment exactly
2016-04-19 13:53:29 <paravoid> then let's !log each step
2016-04-19 13:53:33 <_joe_> yes
2016-04-19 13:53:55 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: 2] switchover: block maintenance scripts from running in eqiad [puppet] - https://gerrit.wikimedia.org/r/284182 (owner: Giuseppe Lavagetto)
2016-04-19 13:55:22 <_joe_> should we start paravoid ?
2016-04-19 13:55:28 <jynus> Am I the one deploying to tin step 4, when ready?
2016-04-19 13:55:41 <mark> you were gonna start 39s ago :P
2016-04-19 13:55:49 <ori> !log [switchover #1]: disabling eqiad jobrunners via "salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'".
2016-04-19 13:55:57 <paravoid> yes, go ahead
2016-04-19 13:56:14 <paravoid> I like the [switchover #N] notation too, let's use that consistently
2016-04-19 13:56:25 <mark> yep
2016-04-19 13:56:26 <_joe_> !log [switchover #3] disabling cronjobs on terbium
2016-04-19 13:56:30 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 13:56:37 <subbu> all deployments happening from tin, right?
2016-04-19 13:56:45 <paravoid> subbu: yes
2016-04-19 13:56:47 <subbu> k
2016-04-19 13:57:00 <ori> #1 completed and verified
2016-04-19 13:57:08 <_joe_> 1?
2016-04-19 13:57:13 <_joe_> it was 2 I thought :P
2016-04-19 13:57:16 <mark> #1 is warmup databases :P
2016-04-19 13:57:19 <_joe_> I am waiting on puppet
2016-04-19 13:57:40 <ori> #2, right. sorry.
2016-04-19 13:57:44 <paravoid> that's going to be a recurring theme today
2016-04-19 13:57:49 <paravoid> (waiting on puppet)
2016-04-19 13:58:01 <jynus> we should have call them names
2016-04-19 13:58:54 <paravoid> _joe_: signal when done
2016-04-19 13:59:00 <_joe_> ok, tendril crons still active, but we can go on
2016-04-19 13:59:02 <mark> adds to learning: puppet too slow
2016-04-19 13:59:16 <paravoid> we knew that already :P
2016-04-19 13:59:21 <mark> ssssh
2016-04-19 13:59:24 <jynus> yes, tendril is no blocker/doesn't affect mediawiki
2016-04-19 13:59:28 <grrrit-wm> (PS2) Faidon Liambotis: Put eqiad in read-only mode for datacenter switchover to codfw [mediawiki-config] - https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) (owner: Jcrespo)
2016-04-19 13:59:29 <_joe_> ok
2016-04-19 13:59:38 <paravoid> let's move with #4 then?
2016-04-19 13:59:41 <_joe_> yes
2016-04-19 13:59:44 <mark> yes
2016-04-19 13:59:57 <_joe_> I'll work on tendril in the meanwhile
2016-04-19 14:00:06 <jynus> about to merge #4
2016-04-19 14:00:14 <paravoid> thank you jynus
2016-04-19 14:00:21 <grrrit-wm> (CR) Jcrespo: [C: 2] Put eqiad in read-only mode for datacenter switchover to codfw [mediawiki-config] - https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) (owner: Jcrespo)
2016-04-19 14:01:21 <jynus> !log [switchover #4] Set mediawiki-eqiad in read-only mode for datacenter switchover to codfw
2016-04-19 14:01:26 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:01:51 <Danny_B> wohoo! hour H is here!
2016-04-19 14:02:01 <jynus> wait for deployment confirmation
2016-04-19 14:02:22 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-eqiad.php: Set mediawiki-eqiad in read-only mode for datacenter switchover to codfw (duration: 00m 35s)
2016-04-19 14:02:27 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:02:45 <jynus> ^that is the confirmation, secondary confirmation on edit save would be welcome
2016-04-19 14:03:04 <paravoid> !log sites in planned readonly-mode, cf. http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch/
2016-04-19 14:03:09 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:03:15 <ori> jynus: yes, confirmed
2016-04-19 14:03:19 <paravoid> awesome
2016-04-19 14:03:22 <mark> confirmed
2016-04-19 14:03:24 <volans> if I go on edit I got the box on the right of the read-only
2016-04-19 14:03:32 <jynus> same here
2016-04-19 14:03:41 <ori> LTR chauvinists
2016-04-19 14:03:50 <paravoid> _joe_: will you do #5 (wipe memcached)?
2016-04-19 14:03:51 <subbu> :)
2016-04-19 14:03:57 <_joe_> paravoid: yes, on it
2016-04-19 14:04:00 <jynus> ok to move ahead
2016-04-19 14:04:01 <paravoid> thank you
2016-04-19 14:04:11 <ori> that's #6, no?
2016-04-19 14:04:21 <bblack> confirmed anonymous edit has readonly block at top
2016-04-19 14:04:29 <mark> yes that's #6
2016-04-19 14:04:42 <volans> QPS on the masters dropping too
2016-04-19 14:04:57 <grrrit-wm> (CR) Faidon Liambotis: [C: 2] switchover: set mediawiki master datacenter to codfw [puppet] (switchover) - https://gerrit.wikimedia.org/r/282898 (owner: Giuseppe Lavagetto)
2016-04-19 14:05:19 <_joe_> paravoid: you need to cherry-pick to production
2016-04-19 14:05:22 <paravoid> ori: can you handle https://gerrit.wikimedia.org/r/#/c/282897/ next? (do not deploy just yet, just heads-up)
2016-04-19 14:05:38 <paravoid> _joe_: that's the cherry-picked one, I believe
2016-04-19 14:05:50 <_joe_> (switchover)
2016-04-19 14:05:50 <ori> i'll rebase it but not merge yet
2016-04-19 14:05:54 <grrrit-wm> (PS2) Ori.livneh: Switch wmfMasterDatacenter to codfw [mediawiki-config] - https://gerrit.wikimedia.org/r/282897 (owner: Giuseppe Lavagetto)
2016-04-19 14:06:05 <paravoid> I'm dpeloying that puppet change, ack?
2016-04-19 14:06:11 <paravoid> this puppet change: https://gerrit.wikimedia.org/r/282898
2016-04-19 14:06:12 <_joe_> paravoid: we need to wait for #5
2016-04-19 14:06:25 <bblack> yeah
2016-04-19 14:06:27 <bblack> Set $app_routes['mediawiki'] = 'codfw' in puppet (cherry-pick https://gerrit.wikimedia.org/r/282898)
2016-04-19 14:06:42 <paravoid> er, right
2016-04-19 14:07:16 <bblack> also, that doesn't list a puppet run after merge, but should include one on eqiad+codfw text caches
2016-04-19 14:07:17 <paravoid> jynus: are you doing "5. set eqiad databases (masters) in read-only mode."?
2016-04-19 14:07:32 <jynus> yes, it is not a blocker for the others
2016-04-19 14:07:40 <jynus> but doing it now
2016-04-19 14:07:41 <_joe_> bblack: does it?
2016-04-19 14:07:42 <mark> we just said it IS a blocker
2016-04-19 14:07:52 <akosiaris> I was about to say tht
2016-04-19 14:08:03 <bblack> oh sorry, ignore me
2016-04-19 14:08:15 <_joe_> akosiaris: when paravoid merges that change, are you/mobrovac onto services and restbase?
2016-04-19 14:08:33 <akosiaris> _joe_: yeah
2016-04-19 14:08:35 <bblack> still, app_routes 282898 is listed before wmfMasterDatacenter 282897
2016-04-19 14:08:38 <paravoid> jynus: please confirm
2016-04-19 14:08:41 <jynus> wait
2016-04-19 14:08:47 <paravoid> ack
2016-04-19 14:08:55 <volans> jynus: on s1 master only heartbeat on tail -f of the binlog
2016-04-19 14:08:56 <jynus> not yet donoe
2016-04-19 14:09:01 <mobrovac> _joe_: akosiaris: only rb and scb nodes need the puppet run
2016-04-19 14:09:10 <mobrovac> akosiaris: you take scb, i'll take restbase
2016-04-19 14:09:17 <_joe_> !log [switchover #6] disabled puppet on all redis hosts as a safety measure before inverting replication after the puppet change
2016-04-19 14:09:18 <akosiaris> ok
2016-04-19 14:09:22 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:10:04 <jynus> !log [switchover #5] DB Masters on eqiad set as read-only, and confirmed it
2016-04-19 14:10:08 <paravoid> _joe_: can you confirm you wiped memcached?
2016-04-19 14:10:08 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:10:23 <volans> spike on DB errors, looking
2016-04-19 14:10:31 <_joe_> !log [switchover #6] wiped memcached
2016-04-19 14:10:35 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:10:37 <paravoid> thank you
2016-04-19 14:10:44 <paravoid> waiting for volans now
2016-04-19 14:10:50 <paravoid> next step will be #7
2016-04-19 14:11:02 <jynus> db errors should be normal, parsercache updates and other errors
2016-04-19 14:11:14 <_joe_> paravoid: that change was not cherry-picked
2016-04-19 14:11:33 <grrrit-wm> (PS1) Giuseppe Lavagetto: switchover: set mediawiki master datacenter to codfw [puppet] - https://gerrit.wikimedia.org/r/284189
2016-04-19 14:11:34 <jynus> it is parsercache, volans
2016-04-19 14:11:35 <mark> we're now 10 minutes into our 30 min read-only window
2016-04-19 14:11:36 <mobrovac> subbu: parsoid should be switched too as soon as _joe_ applies $app_route changes in puppet
2016-04-19 14:11:41 <volans> most of them are REPLACE INTO `pc152` (keyname,value,exptime) VALUES jynus
2016-04-19 14:11:44 <volans> yes
2016-04-19 14:11:56 <_joe_> paravoid: when ready merge https://gerrit.wikimedia.org/r/284189
2016-04-19 14:11:59 <paravoid> _joe_: it's on top of current production and applies cleanly
2016-04-19 14:12:00 <jynus> everything seems ok on db side, we can continue
2016-04-19 14:12:01 <_joe_> and tell us
2016-04-19 14:12:11 <grrrit-wm> (CR) Yurik: [C: ] "Hoo, could you elaborate? This is exactly the same approach as used by all the other api calls, such as pageviews api, MW api, etc. How i" [mediawiki-config] - https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: Yurik)
2016-04-19 14:12:16 <subbu> mobrovac, ok. let me know when.
2016-04-19 14:12:20 <paravoid> I'm not sure what the difference that you see is
2016-04-19 14:12:34 <_joe_> branch: production vs branch: switchover
2016-04-19 14:12:36 <ori> i'll do https://gerrit.wikimedia.org/r/#/c/282897/ after paravoid
2016-04-19 14:12:40 <paravoid> oh, ugh
2016-04-19 14:12:53 <_joe_> heh
2016-04-19 14:12:57 <paravoid> volans, jynus: ack to proceed?
2016-04-19 14:13:03 <bblack> I don't see 284189 in our directions at all, is that a replacement for something else?
2016-04-19 14:13:14 <jynus> paravoid, jynus> everything seems ok on db side, we can continue
2016-04-19 14:13:19 <volans> ack
2016-04-19 14:13:21 <_joe_> bblack: it's the cherry-pick of the change in the directions
2016-04-19 14:13:23 <paravoid> yeah, that cherry-picking thing needs to go in our learnings :)
2016-04-19 14:13:29 <paravoid> don't do that branch thing again :P
2016-04-19 14:13:31 <paravoid> ok
2016-04-19 14:13:32 <bblack> ok
2016-04-19 14:13:33 <paravoid> proceeding with #7
2016-04-19 14:13:33 <mobrovac> haha
2016-04-19 14:13:45 <grrrit-wm> (CR) Faidon Liambotis: [C: 2] switchover: set mediawiki master datacenter to codfw [puppet] - https://gerrit.wikimedia.org/r/284189 (owner: Giuseppe Lavagetto)
2016-04-19 14:13:45 <mark> is in it already
2016-04-19 14:14:01 <grrrit-wm> (CR) Faidon Liambotis: [V: 2] switchover: set mediawiki master datacenter to codfw [puppet] - https://gerrit.wikimedia.org/r/284189 (owner: Giuseppe Lavagetto)
2016-04-19 14:14:07 <_joe_> paravoid: tell us when puppet-merged
2016-04-19 14:14:16 <mobrovac> lemme know when 7a 7b && 7c are done
2016-04-19 14:14:29 <_joe_> mobrovac: they can go in parallel
2016-04-19 14:14:30 <paravoid> !log [switchover #7] setging mediawiki master datacenter to codfw in puppet
2016-04-19 14:14:35 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:14:42 <paravoid> it's palladium-merged
2016-04-19 14:14:50 <mobrovac> ok
2016-04-19 14:14:51 <paravoid> ori: go ahead
2016-04-19 14:14:51 <_joe_> !log [switchover #7] running puppet on mc* hosts in codfw
2016-04-19 14:14:55 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:15:04 <paravoid> mobrovac: go ahead
2016-04-19 14:15:05 <bblack> did we merge the mw-config one yet? isn't that before all the puppet runs?
2016-04-19 14:15:07 <mobrovac> kk
2016-04-19 14:15:10 <ori> !log [switchover #7] Switch wmfMasterDatacenter to codfw (https://gerrit.wikimedia.org/r/#/c/282897/)
2016-04-19 14:15:13 <mobrovac> subbu: go, you too
2016-04-19 14:15:14 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:15:20 <grrrit-wm> (CR) Ori.livneh: [C: 2 V: 2] Switch wmfMasterDatacenter to codfw [mediawiki-config] - https://gerrit.wikimedia.org/r/282897 (owner: Giuseppe Lavagetto)
2016-04-19 14:15:20 <subbu> ok. starting parsoid sync
2016-04-19 14:15:26 <paravoid> bblack: ori just doing that
2016-04-19 14:15:36 <mobrovac> !log [switchover #7] puppet agent -tv && restbase restart
2016-04-19 14:15:38 <bblack> doesn't interact with the puppet runs that started first?
2016-04-19 14:15:40 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:16:17 <paravoid> mobrovac/akosiaris: are you doing sca/scb?
2016-04-19 14:16:22 <akosiaris> !log [switchover #7] puppet agent -t -v on SCA, SCB cluster
2016-04-19 14:16:27 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:16:28 <paravoid> awesome, thanks
2016-04-19 14:16:30 <subbu> synced code. restarting parsoid
2016-04-19 14:16:33 <mark> we're now halfway our 30 min read-only window
2016-04-19 14:16:36 <logmsgbot> !log ori@tin Synchronized wmf-config/CommonSettings.php: Idbfb0184d: Switch wmfMasterDatacenter to codfw (duration: 00m 30s)
2016-04-19 14:16:39 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:16:49 <paravoid> let's proceed with #9, yes?
2016-04-19 14:16:58 <paravoid> deploy the varnish change, that is
2016-04-19 14:17:00 <paravoid> any objections?
2016-04-19 14:17:07 <_joe_> paravoid: nope
2016-04-19 14:17:19 <_joe_> as long as dbs are read-only in both dcs
2016-04-19 14:17:25 <bblack> the directions should've documented which steps are serial dependencies and which can be parallel
2016-04-19 14:17:35 <mark> bblack: i already put that in the learning pad
2016-04-19 14:17:35 <jynus> dbs are read-only on both datacenters
2016-04-19 14:17:36 <akosiaris> in the learning
2016-04-19 14:17:37 <_joe_> yes
2016-04-19 14:17:38 <akosiaris> ok
2016-04-19 14:17:42 <bblack> I'm assuming 7c (puppet runs) did *not* depende on 7b (mw-config)
2016-04-19 14:17:52 <subbu> i am getting this error with restart ..
2016-04-19 14:17:52 <subbu> subbu@earth:~$ for wtp in `ssh ssastry@bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh ssastry@$wtp sudo service parsoid restart ; done
2016-04-19 14:17:52 <subbu> cat: /etc/dsh/group/parsoid: No such file or directory
2016-04-19 14:17:56 <bblack> in spite of the "consequences" language?
2016-04-19 14:17:56 <subbu> what bast node should i use?
2016-04-19 14:18:09 <_joe_> puppet is taking forever to run
2016-04-19 14:18:11 <_joe_> of course
2016-04-19 14:18:14 <paravoid> urgh
2016-04-19 14:18:17 <akosiaris> subbu: i do it
2016-04-19 14:18:20 <subbu> ok, thanks.
2016-04-19 14:18:23 <paravoid> I was about to ask you akosiaris :)
2016-04-19 14:18:32 <jynus> (still seeing most of the traffic going to eqiad backends at this moment)
2016-04-19 14:18:34 <paravoid> (this is probably an artifact of the bast1001 reinstall, I'm guessing)
2016-04-19 14:18:44 <bblack> prepping/rebasing for #9 (Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet)
2016-04-19 14:18:47 <icinga-wm> RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
2016-04-19 14:18:49 <paravoid> thank you bblack
2016-04-19 14:18:57 <grrrit-wm> (PS3) BBlack: switchover: switch api/appservers/rendering varnish routing from eqiad to codfw [puppet] - https://gerrit.wikimedia.org/r/282910
2016-04-19 14:19:00 <paravoid> ori: any objections from your side with moving forward with #9?
2016-04-19 14:19:09 <subbu> !log manually restarted parsoid on wtp1001 and confirmed html identical before/after switchover on enwiki:Hospet
2016-04-19 14:19:11 <ori> no objections, paravoid
2016-04-19 14:19:13 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:19:27 <paravoid> bblack: go ahead with #9
2016-04-19 14:19:38 <akosiaris> !log [switchover #8] restarting parsoid on all wtp nodes
2016-04-19 14:19:42 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:19:45 <_joe_> !log [switchover #7] memcached redises are now masters in codfw, running puppet on eqiad to start replicating
2016-04-19 14:19:49 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:19:55 <grrrit-wm> (CR) BBlack: [C: 2 V: 2] switchover: switch api/appservers/rendering varnish routing from eqiad to codfw [puppet] - https://gerrit.wikimedia.org/r/282910 (owner: BBlack)
2016-04-19 14:20:12 <akosiaris> poor puppetmasters ... 100% constantly
2016-04-19 14:20:12 <_joe_> puppet runs in codfw are way slower than in eqiad btw
2016-04-19 14:20:12 <paravoid> godog: prepare for #10
2016-04-19 14:20:26 <godog> paravoid: yup
2016-04-19 14:20:41 <grrrit-wm> (PS4) Filippo Giunchedi: swift: switch to codfw imagescalers [puppet] - https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869)
2016-04-19 14:20:45 <paravoid> it can happen in parallel too, but since #9 is one of the most risky parts of the migration, let's wait for that to be done first
2016-04-19 14:20:55 <bblack> !log [switchover #9] varnish - change merged, puppet runs starting
2016-04-19 14:20:57 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:20:59 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:21:02 <godog> yeah no problem in waiting for #10
2016-04-19 14:21:15 <subbu> akosiaris, i am not seeing restart log messages on wtp2001.codfw.net in /var/log/parsoid/parsoid.log
2016-04-19 14:21:21 <subbu> did you restart codfw nodes as well?
2016-04-19 14:21:26 <jynus> ^checking PROBLEM
2016-04-19 14:21:41 <jynus> probably not an issue
2016-04-19 14:21:42 <akosiaris> subbu: yup
2016-04-19 14:21:53 <akosiaris> it just reported having done it successfully
2016-04-19 14:21:59 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2004 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.37:6379 has 1 databases (db0) with 132831 keys
2016-04-19 14:22:07 <paravoid> _joe_: ^^
2016-04-19 14:22:07 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2009 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.39:6379 has 1 databases (db0) with 125065 keys
2016-04-19 14:22:08 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2012 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.42:6379 has 1 databases (db0) with 131907 keys
2016-04-19 14:22:10 <_joe_> this is expected
2016-04-19 14:22:21 <akosiaris> subbu: is it ok now ? can you ack ?
2016-04-19 14:22:30 <_joe_> paravoid: discard redis messages
2016-04-19 14:22:30 <subbu> i see it now.
2016-04-19 14:22:33 <paravoid> ack
2016-04-19 14:22:37 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2014 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.21:6379 has 1 databases (db0) with 154029 keys
2016-04-19 14:22:37 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.37:6379 has 1 databases (db0) with 147886 keys
2016-04-19 14:22:37 <icinga-wm> PROBLEM - Redis status tcp_6380 on mc2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.34:6380 has 1 databases (db0) with 142851 keys
2016-04-19 14:22:38 <akosiaris> ok, thanks
2016-04-19 14:22:39 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2015 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.22:6379 has 1 databases (db0) with 146559 keys
2016-04-19 14:22:39 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2002 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.35:6379 has 1 databases (db0) with 153457 keys
2016-04-19 14:22:40 <subbu> akosiaris, must have been a rolling retart.
2016-04-19 14:22:46 <akosiaris> subbu: yup
2016-04-19 14:22:48 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2006 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.39:6379 has 1 databases (db0) with 148754 keys
2016-04-19 14:22:49 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.34:6379 has 1 databases (db0) with 140053 keys
2016-04-19 14:22:49 <icinga-wm> PROBLEM - Redis status tcp_6380 on mc2016 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.23:6380 has 1 databases (db0) with 152470 keys
2016-04-19 14:22:57 <jynus> expect also replication lag alerts regarding mysql, they are not blockers (they are a consequence of being read-only)
2016-04-19 14:22:57 <subbu> akosiaris, i confirmed on wtp2001.codfw and wtp1003.eqiad
2016-04-19 14:23:02 <mobrovac> why nobody sent an email saying "this is happening now, RO phase" ?
2016-04-19 14:23:08 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2016 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.23:6379 has 1 databases (db0) with 175083 keys
2016-04-19 14:23:08 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2008 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.38:6379 has 1 databases (db0) with 129407 keys
2016-04-19 14:23:09 <akosiaris> subbu: great!
2016-04-19 14:23:27 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2010 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.40:6379 has 1 databases (db0) with 153045 keys
2016-04-19 14:23:27 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2011 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.41:6379 has 1 databases (db0) with 153620 keys
2016-04-19 14:23:28 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.38:6379 has 1 databases (db0) with 154362 keys
2016-04-19 14:23:30 <paravoid> bblack: to be clear -- waiting for you to confirm
2016-04-19 14:23:38 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.36:6379 has 1 databases (db0) with 155775 keys
2016-04-19 14:23:38 <icinga-wm> PROBLEM - Redis status tcp_6379 on mc2013 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.20:6379 has 1 databases (db0) with 152217 keys
2016-04-19 14:23:41 <_joe_> ori: are you doing the rdb hosts?
2016-04-19 14:23:48 <akosiaris> mark: mobrovac's comment in the learning pad please
2016-04-19 14:23:51 <bblack> paravoid: ack, still puppeting
2016-04-19 14:23:54 <mark> application servers in codfw surging in traffic
2016-04-19 14:23:57 <ori> _joe_: they just finished
2016-04-19 14:24:02 <_joe_> cool
2016-04-19 14:24:03 <mark> mobrovac: mail where?
2016-04-19 14:24:03 <akosiaris> :-)
2016-04-19 14:24:14 <_joe_> can someone take a look at the error logs maybe?
2016-04-19 14:24:17 <mobrovac> mark: to wikitech, wikimedia, somewhere
2016-04-19 14:24:28 <mark> it was announced... anyway, we'll discuss later
2016-04-19 14:24:30 <jynus> mark, confirm dbs in codfw are increasing in traffic, all nominal for now
2016-04-19 14:24:37 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1058 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:24:50 <_joe_> traffic coming to the api servers too
2016-04-19 14:25:05 <bblack> !log [switchover #9] varnish - puppet runs complete - done
2016-04-19 14:25:09 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:25:10 <paravoid> awesome
2016-04-19 14:25:15 <_joe_> cool
2016-04-19 14:25:21 <paravoid> the site works for me
2016-04-19 14:25:21 <mobrovac> !log [switchover #7] restbase now uses MW from codfw
2016-04-19 14:25:25 <_joe_> godog: you're up now :P
2016-04-19 14:25:25 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:25:37 <paravoid> everyone confirm everything looks okay and we can move forward?
2016-04-19 14:25:39 <jynus> as a reminder, ignore alerts related to heartbeat/replication for now
2016-04-19 14:25:46 <paravoid> godog: please go ahead
2016-04-19 14:25:49 <_joe_> aye
2016-04-19 14:25:49 <jynus> +1
2016-04-19 14:25:52 <grrrit-wm> (CR) Lydia Pintscher: [C: -1] "We should not do this without a clearer understanding of how it is going to play together with the queries we have planned for on-wiki. I'" [mediawiki-config] - https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: Yurik)
2016-04-19 14:26:00 <grrrit-wm> (CR) Filippo Giunchedi: [C: 2 V: 2] swift: switch to codfw imagescalers [puppet] - https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) (owner: Filippo Giunchedi)
2016-04-19 14:26:02 <paravoid> jynus: prepare for #11 :)
2016-04-19 14:26:03 <bblack> confirm anonymous cache-miss on enwiki works, and edit -> readonly block still
2016-04-19 14:26:04 <akosiaris> I am seeing a lot of 500s for GETs btw...
2016-04-19 14:26:09 <jynus> preparing for #11
2016-04-19 14:26:20 <paravoid> yeah, 5xxs are spiking
2016-04-19 14:26:27 <volans> jynus: I think you can already kill pt-heartbeat in the eqiad masters
2016-04-19 14:26:29 <_joe_> akosiaris: yes
2016-04-19 14:26:35 <bblack> it's very slow
2016-04-19 14:26:40 <jynus> volans, please help me with that
2016-04-19 14:26:47 <paravoid> jynus: do not proceed with #11 yet, let's investigate the 500s first
2016-04-19 14:26:48 <icinga-wm> PROBLEM - Redis status tcp_6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6380 has 1 databases (db0) with 5026135 keys
2016-04-19 14:26:50 <bblack> well first cache fill anyways, if I pick an unlikely page
2016-04-19 14:26:54 <subbu> as of 2 mins back, I see parsoid requests now coming to codfw .. stopped on eqiad.
2016-04-19 14:27:00 <godog> paravoid: I've already puppet-merged, rollback #10 ?
2016-04-19 14:27:02 <volans> jynus: sure, if I can do that, paravoid ok
2016-04-19 14:27:04 <paravoid> godog: no
2016-04-19 14:27:08 <icinga-wm> PROBLEM - Redis status tcp_6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6381 has 1 databases (db0) with 5026851 keys
2016-04-19 14:27:08 <icinga-wm> PROBLEM - Redis status tcp_6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6480 has 1 databases (db0) with 5020764 keys
2016-04-19 14:27:08 <icinga-wm> PROBLEM - Redis status tcp_6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6378 has 1 databases (db0) with 14 keys
2016-04-19 14:27:15 <_joe_> jynus, volans can it be the parsercache?
2016-04-19 14:27:15 <ori> lots of: Caused by: [Exception DBConnectionError] (/srv/mediawiki/php-1.27.0-wmf.21/includes/db/Database.php:743) DB connection error: Can't connect to MySQL server on '10.192.16.172' (4) (10.192.16.172)
2016-04-19 14:27:27 <paravoid> that's es2018
2016-04-19 14:27:29 <icinga-wm> PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6479 has 1 databases (db0) with 5010242 keys
2016-04-19 14:27:37 <icinga-wm> PROBLEM - Redis status tcp_6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6379 has 1 databases (db0) with 9710825 keys
2016-04-19 14:27:37 <icinga-wm> PROBLEM - Redis status tcp_6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6481 has 1 databases (db0) with 5031574 keys
2016-04-19 14:27:38 <jynus> it could be a grant problem
2016-04-19 14:27:51 <jynus> checking it
2016-04-19 14:27:52 <ori> some also: Caused by: [Exception DBConnectionError] (/srv/mediawiki/php-1.27.0-wmf.21/includes/db/Database.php:743) DB connection error: Too many connections (10.192.0.142)
2016-04-19 14:28:04 <icinga-wm> PROBLEM - Redis status tcp_6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6380 has 1 databases (db0) with 5008314 keys
2016-04-19 14:28:04 <jynus> then, no, it is a saturation problem
2016-04-19 14:28:06 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2014 is OK: OK: REDIS on 10.192.32.21:6379 has 1 databases (db0) with 155388 keys
2016-04-19 14:28:06 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2007 is OK: OK: REDIS on 10.192.16.37:6379 has 1 databases (db0) with 149110 keys
2016-04-19 14:28:06 <icinga-wm> RECOVERY - Redis status tcp_6380 on mc2001 is OK: OK: REDIS on 10.192.0.34:6380 has 1 databases (db0) with 144121 keys
2016-04-19 14:28:08 <bblack> I see 5xx in my varnish-fe graphs spiking around :22->:25, but seems to be coming back to normal now
2016-04-19 14:28:13 <bblack> (for cache_text)
2016-04-19 14:28:14 <icinga-wm> RECOVERY - Redis status tcp_6381 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6381 has 1 databases (db0) with 5026823 keys
2016-04-19 14:28:14 <volans> Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached.
2016-04-19 14:28:14 <icinga-wm> PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
2016-04-19 14:28:15 <icinga-wm> RECOVERY - Redis status tcp_6378 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6378 has 1 databases (db0) with 14 keys
2016-04-19 14:28:15 <icinga-wm> RECOVERY - Redis status tcp_6480 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6480 has 1 databases (db0) with 5020753 keys
2016-04-19 14:28:18 <volans> on es2018
2016-04-19 14:28:24 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2002 is OK: OK: REDIS on 10.192.0.35:6379 has 1 databases (db0) with 154783 keys
2016-04-19 14:28:24 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2015 is OK: OK: REDIS on 10.192.32.22:6379 has 1 databases (db0) with 147761 keys
2016-04-19 14:28:26 <paravoid> wfLogDBError.log is a mess
2016-04-19 14:28:35 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2011 is OK: OK: REDIS on 10.192.16.41:6379 has 1 databases (db0) with 154663 keys
2016-04-19 14:28:35 <icinga-wm> RECOVERY - Redis status tcp_6380 on mc2016 is OK: OK: REDIS on 10.192.32.23:6380 has 1 databases (db0) with 153809 keys
2016-04-19 14:28:35 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2006 is OK: OK: REDIS on 10.192.0.39:6379 has 1 databases (db0) with 149998 keys
2016-04-19 14:28:35 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2001 is OK: OK: REDIS on 10.192.0.34:6379 has 1 databases (db0) with 141181 keys
2016-04-19 14:28:38 <akosiaris> yup
2016-04-19 14:28:41 <jynus> there is high load on es2* servers
2016-04-19 14:28:43 <paravoid> because of read-only mode though
2016-04-19 14:28:44 <icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
2016-04-19 14:28:45 <icinga-wm> RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6479 has 1 databases (db0) with 5010234 keys
2016-04-19 14:28:45 <icinga-wm> RECOVERY - Redis status tcp_6481 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6481 has 1 databases (db0) with 5031520 keys
2016-04-19 14:28:45 <icinga-wm> RECOVERY - Redis status tcp_6379 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6379 has 1 databases (db0) with 9710806 keys
2016-04-19 14:28:46 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2008 is OK: OK: REDIS on 10.192.16.38:6379 has 1 databases (db0) with 130286 keys
2016-04-19 14:29:04 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2005 is OK: OK: REDIS on 10.192.0.38:6379 has 1 databases (db0) with 155418 keys
2016-04-19 14:29:05 <icinga-wm> RECOVERY - Redis status tcp_6380 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6380 has 1 databases (db0) with 5008279 keys
2016-04-19 14:29:06 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2016 is OK: OK: REDIS on 10.192.32.23:6379 has 1 databases (db0) with 176670 keys
2016-04-19 14:29:23 <jynus> is it read only or is it max connections?
2016-04-19 14:29:24 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2010 is OK: OK: REDIS on 10.192.16.40:6379 has 1 databases (db0) with 154184 keys
2016-04-19 14:29:35 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2013 is OK: OK: REDIS on 10.192.32.20:6379 has 1 databases (db0) with 153311 keys
2016-04-19 14:29:37 <volans> 1425 open connections on es2018
2016-04-19 14:29:38 <paravoid> some are because of read-only mode and are thus noise
2016-04-19 14:29:42 <_joe_> jynus: the 5xx logs seem to go down
2016-04-19 14:29:49 <_joe_> so maybe it's getting better?
2016-04-19 14:29:51 <jynus> 2000 connections to es2*
2016-04-19 14:29:56 <volans> out of threads
2016-04-19 14:30:06 <volans> reached thread_pool_max_threads
2016-04-19 14:30:13 <jynus> that was within the reasonable, problems on first spike
2016-04-19 14:30:14 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2004 is OK: OK: REDIS on 10.192.0.37:6379 has 1 databases (db0) with 134145 keys
2016-04-19 14:30:19 <_joe_> yes
2016-04-19 14:30:31 <bblack> so far data from 14:29 onwards looks like 5xx is dropped back off mostly
2016-04-19 14:30:35 <jynus> should I go out of scrip and enable read-write on parsercaches?
2016-04-19 14:30:36 <paravoid> 500s seem to be down
2016-04-19 14:30:37 <paravoid> yes
2016-04-19 14:30:38 <_joe_> was anyone logged in and is still logged in?
2016-04-19 14:30:49 <ori> _joe_: i am, yes
2016-04-19 14:30:55 <icinga-wm> PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1168 bytes in 0.121 second response time
2016-04-19 14:30:55 <_joe_> oh cool
2016-04-19 14:30:55 <paravoid> _joe_: me too
2016-04-19 14:30:59 <jynus> waiting to see ig it is getting better
2016-04-19 14:31:00 <addshore> _joe_: logged in on wiki? yes me
2016-04-19 14:31:05 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2012 is OK: OK: REDIS on 10.192.16.42:6379 has 1 databases (db0) with 133165 keys
2016-04-19 14:31:05 <ori> dispatch lag also expected
2016-04-19 14:31:08 <volans> _joe_: yes
2016-04-19 14:31:09 <_joe_> that is expected (the wikidata lag)
2016-04-19 14:31:13 <paravoid> 500s are getting back to normal
2016-04-19 14:31:14 <icinga-wm> RECOVERY - Redis status tcp_6380 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6380 has 1 databases (db0) with 5026014 keys
2016-04-19 14:31:21 <_joe_> cool so sessions migrated correctly
2016-04-19 14:31:24 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2009 is OK: OK: REDIS on 10.192.16.39:6379 has 1 databases (db0) with 126216 keys
2016-04-19 14:31:26 <mark> we're now at the 30 mark of the read-only window
2016-04-19 14:31:26 <godog> held #10, ready to resume btw
2016-04-19 14:31:32 <_joe_> paravoid: should we go on?
2016-04-19 14:31:37 <paravoid> volans, jynus: waiting for you to confirm whether the db load is back to normal again
2016-04-19 14:31:40 <paravoid> _joe_: ^
2016-04-19 14:31:42 <volans> loadavg on es2018 is half now, seems recoverging
2016-04-19 14:31:44 <icinga-wm> RECOVERY - Redis status tcp_6379 on mc2003 is OK: OK: REDIS on 10.192.0.36:6379 has 1 databases (db0) with 157194 keys
2016-04-19 14:31:49 <jynus> yes
2016-04-19 14:31:52 <jynus> recovered
2016-04-19 14:31:54 <volans> 168 connecsions now
2016-04-19 14:31:54 <paravoid> ok
2016-04-19 14:31:55 <icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
2016-04-19 14:31:55 <_joe_> mark: yeah we're a bit late, but not dramatically I'd say
2016-04-19 14:31:58 <jynus> I think we should go on
2016-04-19 14:32:09 <paravoid> please proceed then
2016-04-19 14:32:09 <jynus> applying #11, wait for log
2016-04-19 14:32:14 <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
2016-04-19 14:32:18 <godog> ok I'll finish up #10
2016-04-19 14:32:22 <_joe_> godog: go on yes :)
2016-04-19 14:32:22 <volans> jynus: do you want me to kill pt-hearthbeast?
2016-04-19 14:32:30 <_joe_> volans: he said so, yes
2016-04-19 14:32:33 <jynus> volans, please, go on
2016-04-19 14:32:34 <godog> !log [switchover #10] running puppet on ms-fe and reload swift
2016-04-19 14:32:35 <bblack> at :30, data still shows a small increase in 500 (not 503), but it's fairly small in the overall
2016-04-19 14:32:39 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:32:41 <grrrit-wm> (PS3) Jcrespo: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: Volans)
2016-04-19 14:32:45 <icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
2016-04-19 14:32:56 <grrrit-wm> (CR) Jcrespo: [C: 2 V: 2] MariaDB: set codfw local masters as masters (s1-s7) [puppet] - https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: Volans)
2016-04-19 14:33:00 <bblack> ~ 5/sec 500s, vs usually near-zero
2016-04-19 14:33:24 <_joe_> bblack: well let's see once the migration is done
2016-04-19 14:34:04 <jynus> !log [switchover #11] applying $master change for codfw masters
2016-04-19 14:34:09 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:34:10 <jynus> still only on palladium
2016-04-19 14:34:16 <bblack> it's dropping back to near-zero in :31-:32 data now, will know for sure in a couple more minutes
2016-04-19 14:34:17 <jynus> running puppet now
2016-04-19 14:34:49 <_joe_> jynus: should you also set the codfw master to rw now, right?
2016-04-19 14:34:57 <_joe_> or should we wait for this step to finish?
2016-04-19 14:35:17 <paravoid> _joe_: the eqiad one you mean?
2016-04-19 14:35:25 <paravoid> I hope not :P
2016-04-19 14:35:30 <_joe_> paravoid: nope :P
2016-04-19 14:35:40 <jynus> I need heartbeat working first
2016-04-19 14:35:45 <jynus> then, set read-write
2016-04-19 14:35:56 <SPF|Cloud> fail
2016-04-19 14:36:02 <paravoid> did you mean #12, _joe_?
2016-04-19 14:36:11 <paravoid> if that's the case, then no
2016-04-19 14:36:43 <jynus> puppet is very slow
2016-04-19 14:37:05 <icinga-wm> RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
2016-04-19 14:37:21 <_joe_> yeah I meant #12
2016-04-19 14:37:25 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:37:28 <paravoid> _joe_: you're so impatient :)
2016-04-19 14:37:36 <godog> !log [switchover #10] puppet and swift reload finished
2016-04-19 14:37:40 <_joe_> I wanted to help :P
2016-04-19 14:37:40 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:37:52 <akosiaris> ah good I was wondering about swift.. ok
2016-04-19 14:38:21 <_joe_> gehel: is search all right?
2016-04-19 14:38:24 <icinga-wm> RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
2016-04-19 14:38:28 <paravoid> just tried resizing an image, it worked
2016-04-19 14:38:36 <icinga-wm> RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
2016-04-19 14:38:50 <mark> jynus: status?
2016-04-19 14:39:11 <jynus> puppet being run on the masters
2016-04-19 14:39:30 <_joe_> we need to upgrade thos puppetmasters to ruby 2.1
2016-04-19 14:39:36 <volans> heartbeat killed on all eqiad masters
2016-04-19 14:39:41 <jynus> I think we can go on the next step
2016-04-19 14:39:42 <_joe_> that would run way faster
2016-04-19 14:39:45 <icinga-wm> RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
2016-04-19 14:39:48 <mark> what next step?
2016-04-19 14:39:52 <mark> talk explicitly :)
2016-04-19 14:39:54 <jynus> with the only risk of getting replication alerts
2016-04-19 14:40:01 <jynus> let me see
2016-04-19 14:40:03 <akosiaris> #12
2016-04-19 14:40:09 <paravoid> #12 is the next step, read-write
2016-04-19 14:40:14 <paravoid> but let's wait for #11 to be done first
2016-04-19 14:40:15 <jynus> no
2016-04-19 14:40:18 <gehel> search seems alright...
2016-04-19 14:40:21 <jynus> no go for 12 yet
2016-04-19 14:40:39 <jynus> I need to do substask #11 with is set db masters in read-write
2016-04-19 14:40:41 <jynus> doing it now
2016-04-19 14:40:45 <icinga-wm> RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
2016-04-19 14:40:55 <mark> so next sub step
2016-04-19 14:40:56 <bblack> right we need 11a+b+c before 12, I believe
2016-04-19 14:41:04 <jynus> mark, yes, sorry
2016-04-19 14:41:10 <gehel> CirrusSearch sees a decrease in response time, just as expected...
2016-04-19 14:41:10 <jynus> no go yet for #12
2016-04-19 14:41:17 <jynus> doing #11-2
2016-04-19 14:41:34 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1023 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:41:43 <jynus> ^ignore
2016-04-19 14:41:48 <mark> what's the ETA for being ready for #12?
2016-04-19 14:41:52 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.03 seconds
2016-04-19 14:41:53 <jynus> 1 minute
2016-04-19 14:42:05 <icinga-wm> PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0]
2016-04-19 14:42:22 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.22 seconds
2016-04-19 14:42:22 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.62 seconds
2016-04-19 14:42:32 <paravoid> gehel: can you investigate wdqs in the meantime?
2016-04-19 14:42:42 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.46 seconds
2016-04-19 14:42:49 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.94 seconds
2016-04-19 14:42:49 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 369.05 seconds
2016-04-19 14:43:00 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.21 seconds
2016-04-19 14:43:09 <jynus> !log [swithchover #11-2] Set and confirmed codfw master dbs in read-write
2016-04-19 14:43:09 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.94 seconds
2016-04-19 14:43:09 <gehel> paravoid: sure
2016-04-19 14:43:09 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 366.10 seconds
2016-04-19 14:43:09 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 402.12 seconds
2016-04-19 14:43:09 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 402.15 seconds
2016-04-19 14:43:13 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:43:24 <jynus> the errors come from the not finished puppet
2016-04-19 14:43:27 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 418.29 seconds
2016-04-19 14:43:27 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db1062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.43 seconds
2016-04-19 14:43:30 <paravoid> _joe_: as for puppet, 2.1 is not going to be miraculous; we just need to rely on it far less for runtime configuratons
2016-04-19 14:43:34 <jynus> they should not affect users
2016-04-19 14:43:36 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.97 seconds
2016-04-19 14:43:37 <paravoid> jynus: ack
2016-04-19 14:43:41 <_joe_> paravoid: and that too, yes
2016-04-19 14:43:42 <jynus> we can go with #12
2016-04-19 14:43:46 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 437.65 seconds
2016-04-19 14:43:46 <paravoid> jynus: can you confirm which #11 substeps are done now?
2016-04-19 14:43:55 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 410.75 seconds
2016-04-19 14:43:58 <jynus> 11-1 (in progress)
2016-04-19 14:44:04 <jynus> 11-2 (done)
2016-04-19 14:44:07 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 440.53 seconds
2016-04-19 14:44:14 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 458.24 seconds
2016-04-19 14:44:14 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 405.28 seconds
2016-04-19 14:44:15 <jynus> 11-3 (aborted)
2016-04-19 14:44:21 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es2014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.18 seconds
2016-04-19 14:44:23 <paravoid> aborted?
2016-04-19 14:44:25 <akosiaris> ?
2016-04-19 14:44:25 <mark> aborted?
2016-04-19 14:44:33 <jynus> not a blocker
2016-04-19 14:44:34 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.61 seconds
2016-04-19 14:44:34 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1052 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:44:34 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.40 seconds
2016-04-19 14:44:36 <volans> jynus: 11-3 looks done for me, I see read_only = OFF on codfw
2016-04-19 14:44:41 <volans> masters
2016-04-19 14:44:42 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.55 seconds
2016-04-19 14:44:42 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 470.33 seconds
2016-04-19 14:44:44 <bblack> jynus: 11-3 is "Set codfw masters mysql as read-write"
2016-04-19 14:44:48 <mark> how is 11-3 aborted not a blocker?
2016-04-19 14:44:49 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db1061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 451.00 seconds
2016-04-19 14:44:49 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 473.13 seconds
2016-04-19 14:44:50 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.40 seconds
2016-04-19 14:44:53 <jynus> sorry
2016-04-19 14:44:57 <jynus> I meant 11-4
2016-04-19 14:44:57 <akosiaris> maybe he means 11-4
2016-04-19 14:44:58 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db1037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 457.60 seconds
2016-04-19 14:45:00 <akosiaris> ah ok
2016-04-19 14:45:01 <paravoid> ok
2016-04-19 14:45:03 <jynus> sorry about he confusion
2016-04-19 14:45:04 <mark> ok
2016-04-19 14:45:06 <paravoid> everyone ok to proceed with #12 then?
2016-04-19 14:45:10 <mark> yes
2016-04-19 14:45:11 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 525.26 seconds
2016-04-19 14:45:12 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 433.70 seconds
2016-04-19 14:45:12 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db1068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 525.99 seconds
2016-04-19 14:45:12 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 510.02 seconds
2016-04-19 14:45:12 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 510.43 seconds
2016-04-19 14:45:17 <bblack> yes
2016-04-19 14:45:19 <ori> yes
2016-04-19 14:45:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.24 seconds
2016-04-19 14:45:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 491.41 seconds
2016-04-19 14:45:20 <volans> +1
2016-04-19 14:45:21 <jynus> please someone else apply #12
2016-04-19 14:45:24 <paravoid> ori: can you?
2016-04-19 14:45:27 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es2016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 453.58 seconds
2016-04-19 14:45:29 <jynus> while I fix 11-1
2016-04-19 14:45:30 <ori> ok
2016-04-19 14:45:33 <paravoid> thanks
2016-04-19 14:45:38 <jynus> (the cause of the alerts)
2016-04-19 14:45:45 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db1030 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 514.52 seconds
2016-04-19 14:45:45 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 551.45 seconds
2016-04-19 14:45:50 <grrrit-wm> (PS2) Ori.livneh: Set codfw databases in read-write [mediawiki-config] - https://gerrit.wikimedia.org/r/284157 (owner: Jcrespo)
2016-04-19 14:46:00 <grrrit-wm> (CR) Ori.livneh: [C: 2] Set codfw databases in read-write [mediawiki-config] - https://gerrit.wikimedia.org/r/284157 (owner: Jcrespo)
2016-04-19 14:46:05 <paravoid> _joe_: can you prepare for #13/#14?
2016-04-19 14:46:06 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 479.09 seconds
2016-04-19 14:46:13 <icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 554.66 seconds
2016-04-19 14:46:18 <grrrit-wm> (CR) Ori.livneh: [V: 2] Set codfw databases in read-write [mediawiki-config] - https://gerrit.wikimedia.org/r/284157 (owner: Jcrespo)
2016-04-19 14:46:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db1050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 536.55 seconds
2016-04-19 14:46:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 537.65 seconds
2016-04-19 14:46:20 <_joe_> paravoid: yes, but we have an issue there
2016-04-19 14:46:29 <_joe_> O
2016-04-19 14:46:29 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 484.57 seconds
2016-04-19 14:46:39 <paravoid> do tell
2016-04-19 14:46:40 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es2017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 518.46 seconds
2016-04-19 14:46:41 <icinga-wm> ACKNOWLEDGEMENT - High lag on wdqs1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [1800.0] Gehel investigating
2016-04-19 14:46:41 <icinga-wm> ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] Gehel investigating
2016-04-19 14:46:48 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 562.45 seconds
2016-04-19 14:46:55 <ori> jynus: ok to sync?
2016-04-19 14:46:56 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es1011 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 542.55 seconds
2016-04-19 14:47:02 <jynus> so hgere is the thing- replication is running, it is the alerts that have an issue with the change
2016-04-19 14:47:05 <volans> wfLogDBError back to low values
2016-04-19 14:47:06 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es2015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 561.83 seconds
2016-04-19 14:47:11 <jynus> ori, please deploy
2016-04-19 14:47:13 <icinga-wm> PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 523.56 seconds
2016-04-19 14:47:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db1039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 585.15 seconds
2016-04-19 14:47:28 <icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.52 seconds
2016-04-19 14:47:37 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db2029 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 591.81 seconds
2016-04-19 14:47:37 <icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 592.16 seconds
2016-04-19 14:47:59 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.25 seconds
2016-04-19 14:47:59 <jynus> all replicas are healtth and with 0 lag right now
2016-04-19 14:48:05 <jynus> I double checked
2016-04-19 14:48:08 <icinga-wm> PROBLEM - MariaDB Slave Lag: es3 on es2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.53 seconds
2016-04-19 14:48:11 <paravoid> ok
2016-04-19 14:48:14 <paravoid> I put it down in learnings
2016-04-19 14:48:17 <logmsgbot> !log ori@tin Synchronized wmf-config/db-codfw.php: [switchover #12] I5e9635b8f4: Set codfw databases in read-write (duration: 00m 35s)
2016-04-19 14:48:21 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:48:25 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.97 seconds
2016-04-19 14:48:28 <akosiaris> so, all those pages are false positives ?
2016-04-19 14:48:32 <icinga-wm> PROBLEM - MariaDB Slave Lag: es2 on es1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.24 seconds
2016-04-19 14:48:32 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db1062 is OK: OK slave_sql_lag Replication lag: 0.19 seconds
2016-04-19 14:48:33 <bblack> confirmed anon edit -> no readonly box
2016-04-19 14:48:40 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
2016-04-19 14:48:43 <volans> ori: confirmed no more alert on edit
2016-04-19 14:48:48 <paravoid> !log sites are read-write again
2016-04-19 14:48:50 <icinga-wm> PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.99 seconds
2016-04-19 14:48:52 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 14:48:54 <jynus> actually, no, db1029 replica is broken (not a blocker)
2016-04-19 14:49:00 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.34 seconds
2016-04-19 14:49:00 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Replication lag: 0.19 seconds
2016-04-19 14:49:01 <icinga-wm> PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.42 seconds
2016-04-19 14:49:15 <ori> perf save times are coming in, so users are saving
2016-04-19 14:49:18 <godog> seeing rc changes on irc again for enwiki
2016-04-19 14:49:20 <mark> so ~45 mins readonly
2016-04-19 14:49:28 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.35 seconds
2016-04-19 14:49:33 <addshore> I can indeed save!
2016-04-19 14:49:34 <paravoid> 48 :)
2016-04-19 14:49:43 <paravoid> _joe_: what is the issue you mentioned?
2016-04-19 14:49:49 <bblack> 5xx is normal-ish so far
2016-04-19 14:49:50 <mark> 47
2016-04-19 14:49:50 <jynus> so, ongoing issues I have: x1- broken replica to eqiad
2016-04-19 14:49:51 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1029 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:49:51 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:49:58 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.45 seconds
2016-04-19 14:49:58 <icinga-wm> PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 805.70 seconds
2016-04-19 14:50:08 <_joe_> paravoid: a hand-written config on rdb2003
2016-04-19 14:50:09 <jynus> some potential improvements on weights
2016-04-19 14:50:13 <_joe_> which broke replica
2016-04-19 14:50:19 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.16 seconds
2016-04-19 14:50:24 <akosiaris> argh
2016-04-19 14:50:38 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db1039 is OK: OK slave_sql_lag Replication lag: 0.23 seconds
2016-04-19 14:50:43 <subbu> verified a VE edit.
2016-04-19 14:50:45 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2023 is OK: OK slave_sql_lag Replication lag: 0.48 seconds
2016-04-19 14:50:45 <grrrit-wm> (CR) Chad: "@qchris: Yeah I knew it was gonna be a lengthy one. Probably should do it late on a Friday." [debs/gerrit] - https://gerrit.wikimedia.org/r/263631 (owner: Chad)
2016-04-19 14:50:55 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
2016-04-19 14:51:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db2029 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
2016-04-19 14:51:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
2016-04-19 14:51:05 <jynus> puppet execution was understimated
2016-04-19 14:51:16 <mark> I didn't underestimate it :P
2016-04-19 14:51:26 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Replication lag: 0.15 seconds
2016-04-19 14:51:36 <paravoid> _joe_: are you dealing with it? do you need any help?
2016-04-19 14:51:45 <_joe_> paravoid: me and ori are
2016-04-19 14:51:50 <jynus> mark, orchestration out of puppet and I would not had this issue
2016-04-19 14:51:56 <volans> jynus: I'm taking a closer look to all DBs with loadavg > 10, just a few
2016-04-19 14:52:02 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Replication lag: 0.24 seconds
2016-04-19 14:52:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db1071 is OK: OK slave_sql_lag Replication lag: 0.06 seconds
2016-04-19 14:52:08 <jynus> yes, there are some overloaded
2016-04-19 14:52:10 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db1049 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
2016-04-19 14:52:10 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2045 is OK: OK slave_sql_lag Replication lag: 0.35 seconds
2016-04-19 14:52:20 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
2016-04-19 14:52:29 <jynus> sorry for the spam
2016-04-19 14:52:29 <_joe_> so, going on
2016-04-19 14:52:30 <icinga-wm> PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output
2016-04-19 14:52:31 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
2016-04-19 14:52:31 <icinga-wm> RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
2016-04-19 14:52:33 <jynus> but it should not have paged
2016-04-19 14:52:33 <paravoid> _joe_: please use this channel to coordinate :)
2016-04-19 14:52:46 <_joe_> paravoid: yeah we were just checking errors in query
2016-04-19 14:52:53 <_joe_> I am ready to go on
2016-04-19 14:53:04 <paravoid> is that issue dealt with?
2016-04-19 14:53:18 <grrrit-wm> (PS1) Giuseppe Lavagetto: switchover: enable maintenance scripts in codfw [puppet] - https://gerrit.wikimedia.org/r/284195
2016-04-19 14:53:19 <_joe_> yes
2016-04-19 14:53:20 <icinga-wm> PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 899.61 seconds
2016-04-19 14:53:22 <paravoid> ack
2016-04-19 14:53:26 <ori> should i start the job queue in codfw?
2016-04-19 14:53:28 <paravoid> proceed with #13/#14
2016-04-19 14:53:28 <jynus> are maintenance scripts running?
2016-04-19 14:53:34 <_joe_> jynus: in a minute
2016-04-19 14:53:41 <gehel> seems we have an issue with wdqs-updater not being able to update latest edits from wikidata. Not blocking...
2016-04-19 14:53:48 <ori> _joe_: i'll do 13, you do 14, yes?
2016-04-19 14:53:48 <jynus> _joe_, can you hold them for a second?
2016-04-19 14:53:54 <_joe_> yes
2016-04-19 14:54:01 <paravoid> let's hold this off per jynus
2016-04-19 14:54:03 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: 2 V: 2] switchover: enable maintenance scripts in codfw [puppet] - https://gerrit.wikimedia.org/r/284195 (owner: Giuseppe Lavagetto)
2016-04-19 14:54:07 <jynus> volans and I need to check weights
2016-04-19 14:54:08 <_joe_> oh
2016-04-19 14:54:12 <icinga-wm> PROBLEM - MySQL Slave Running on db1029 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error executing row event: Cannot execute statement: impossible to w
2016-04-19 14:54:14 <mark> 16:53:17 <Trizek> Specia:RecentChanges is not refreshing (feedback from en.wp and es.wp
2016-04-19 14:54:14 <mark> 16:53:18 <Trizek> )16:53:17 <Trizek> Specia:RecentChanges is not refreshing (feedback from en.wp and es.wp
2016-04-19 14:54:14 <mark> 16:53:18 <Trizek> )
2016-04-19 14:54:15 <grrrit-wm> (PS1) Ori.livneh: switchover: make jobrunners in codfw start up [puppet] - https://gerrit.wikimedia.org/r/284196
2016-04-19 14:54:17 <bblack> just don't puppet-merge
2016-04-19 14:54:18 <jynus> I may need some adjustements
2016-04-19 14:54:22 <paravoid> ori: ^^
2016-04-19 14:54:22 <jynus> *it
2016-04-19 14:54:27 <ori> holding off, ack
2016-04-19 14:54:39 <_joe_> mark: that's because scripts are not running I think
2016-04-19 14:55:07 <addshore> _joe_: indeed
2016-04-19 14:55:10 <icinga-wm> PROBLEM - Redis status tcp_6380 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.16.183 on port 6380
2016-04-19 14:55:18 <_joe_> uhm
2016-04-19 14:55:19 <paravoid> huh?
2016-04-19 14:55:20 <_joe_> checking
2016-04-19 14:55:35 <Krenair> #en.wikipedia stream is only showing abusefilter logs, not edits
2016-04-19 14:55:50 <addshore> Krenair: that should be the same issue a Special:RC not populating
2016-04-19 14:55:54 <MatmaRex> more feedback: file deletion is apparently throwing exceptions? (i recall some bug about this filed yesterday, so might not be new)
2016-04-19 14:55:57 <Krenair> probably
2016-04-19 14:56:17 <_joe_> Krenair: that could be the jobqueue
2016-04-19 14:56:26 <_joe_> Krenair: rcstream works correctly?
2016-04-19 14:56:29 <_joe_> can you test it
2016-04-19 14:56:34 <_joe_> ?
2016-04-19 14:56:46 <mark> godog: can you look at file deletions?
2016-04-19 14:56:58 <ori> probably also the jobqueue
2016-04-19 14:57:01 <addshore> uploads get an exception too
2016-04-19 14:57:01 <addshore> [VxZHKQrAIE0AAIq6MWkAAABJ] /wiki/Special:Upload JobQueueError from line 200 of /srv/mediawiki/php-1.27.0-wmf.21/includes/jobqueue/JobQueueFederated.php: Could not insert job(s), 5 partitions tried.
2016-04-19 14:57:10 <godog> mark: yup, checking
2016-04-19 14:57:11 <icinga-wm> RECOVERY - Redis status tcp_6380 on rdb1004 is OK: OK: REDIS on 10.64.16.183:6380 has 1 databases (db0) with 9703267 keys - replication_delay is 8
2016-04-19 14:57:13 <Krenair> _joe_, no, I see only logs
2016-04-19 14:57:16 <mark> might be jobqueue indeed
2016-04-19 14:57:23 <_joe_> ori: ^^
2016-04-19 14:57:28 <_joe_> could not insert jobs
2016-04-19 14:57:35 <ori> the aggregators aren't running
2016-04-19 14:57:41 <ori> we should start those
2016-04-19 14:57:43 <_joe_> ok
2016-04-19 14:57:49 <paravoid> jynus: status?
2016-04-19 14:57:52 <bblack> #13/#14 for maint/jobqueue still on hold pending jynus
2016-04-19 14:57:54 <_joe_> what is the aggregator? jobchron?
2016-04-19 14:58:05 <grrrit-wm> (PS1) Jcrespo: Tweak db weights after switchover [mediawiki-config] - https://gerrit.wikimedia.org/r/284197
2016-04-19 14:58:13 <jynus> I need to apply this^
2016-04-19 14:58:20 <ori> do it
2016-04-19 14:58:30 <paravoid> ack
2016-04-19 14:58:48 <grrrit-wm> (CR) Jcrespo: [C: 2] Tweak db weights after switchover [mediawiki-config] - https://gerrit.wikimedia.org/r/284197 (owner: Jcrespo)
2016-04-19 14:58:54 <jynus> there are some databases floping
2016-04-19 14:59:01 <jynus> due to excessive traffic
2016-04-19 14:59:15 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: ] switchover: make jobrunners in codfw start up [puppet] - https://gerrit.wikimedia.org/r/284196 (owner: Ori.livneh)
2016-04-19 14:59:43 <godog> MatmaRex: yeah sth like this I think? https://phabricator.wikimedia.org/T131769
2016-04-19 14:59:47 <MatmaRex> godog: same problem as https://phabricator.wikimedia.org/T132921 ? that was filed before the switchover
2016-04-19 14:59:59 <MatmaRex> huh. more dupes
2016-04-19 15:00:07 <godog> :(
2016-04-19 15:00:10 <wikibugs> Operations, Commons, MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218083 (matmarex)
2016-04-19 15:00:12 <jynus> !log applying database weight changes
2016-04-19 15:00:16 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:00:27 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-codfw.php: db weight teaking to better process the load (duration: 00m 28s)
2016-04-19 15:00:31 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:00:32 <ori> jynus: can I go ahead?
2016-04-19 15:00:37 <paravoid> jynus: shall we move on with #13/#14 or are you still investigating?
2016-04-19 15:00:41 <jynus> ori, paravoid please do ahead
2016-04-19 15:00:43 <_joe_> ok
2016-04-19 15:00:50 <grrrit-wm> (CR) Ori.livneh: [C: 2 V: 2] switchover: make jobrunners in codfw start up [puppet] - https://gerrit.wikimedia.org/r/284196 (owner: Ori.livneh)
2016-04-19 15:00:56 <volans> jynus: s4, s6, es2 and es3 have still alerts on icinga for replication lag for codfw
2016-04-19 15:00:59 <_joe_> ori: I'll puppet-merge
2016-04-19 15:01:11 <jynus> volans, checking
2016-04-19 15:01:15 <ori> _joe_: ack
2016-04-19 15:01:16 <Krenair> wikis are still user-visibly broken
2016-04-19 15:01:23 <paravoid> Krenair: how so?
2016-04-19 15:01:27 <Krenair> paravoid, no RC
2016-04-19 15:01:33 <mark> yeah ok
2016-04-19 15:01:47 <_joe_> ori: merged
2016-04-19 15:02:19 <bblack> PURGE ramping in as expected from jobrunners running already
2016-04-19 15:02:46 <_joe_> !log [switchover #13] starting maintenace jobs
2016-04-19 15:02:50 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:03:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es1019 is OK: OK slave_sql_lag Replication lag: 0.16 seconds
2016-04-19 15:03:21 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es2018 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
2016-04-19 15:03:22 <grrrit-wm> (PS1) Chad: Use legacy key exchanges on yurud, like antimony [puppet] - https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718)
2016-04-19 15:03:32 <bblack> (not at full usual volume yet, but moving the right direction)
2016-04-19 15:03:39 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 2.08 seconds
2016-04-19 15:04:01 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db1037 is OK: OK slave_sql_lag Replication lag: 1.13 seconds
2016-04-19 15:04:03 <paravoid> volans, jynus I see a bunch of "Deadlock found when trying to get lock; try restarting transaction (10.192.0.12)" from ContentTranslation on the DB error log
2016-04-19 15:04:12 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1042 is OK: OK slave_sql_lag Replication lag: 0.15 seconds
2016-04-19 15:04:12 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es1017 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
2016-04-19 15:04:15 <addshore> MatmaRex: those takss you linked re deleting files are different to what is happening now
2016-04-19 15:04:21 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1068 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
2016-04-19 15:04:21 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es1014 is OK: OK slave_sql_lag Replication lag: 0.38 seconds
2016-04-19 15:04:21 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db1022 is OK: OK slave_sql_lag Replication lag: 0.19 seconds
2016-04-19 15:04:30 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2028 is OK: OK slave_sql_lag Replication lag: 0.29 seconds
2016-04-19 15:04:33 <wikibugs> Operations, Commons, MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (Bawolff) >"Unable to delete file pages on commons: "Could not acquire lock"". Locks use redis, probably related to switchover.
2016-04-19 15:04:33 <jynus> paravoid, that is "normal"
2016-04-19 15:04:37 <paravoid> alright
2016-04-19 15:04:39 <ori> I still don't edits in RC, and I am not sure I understand why.
2016-04-19 15:04:42 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db1030 is OK: OK slave_sql_lag Replication lag: 0.37 seconds
2016-04-19 15:04:50 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.47 seconds
2016-04-19 15:04:53 <mark> indeed
2016-04-19 15:05:00 <jynus> ok, puppet was stuck, probably by my fault
2016-04-19 15:05:09 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.09 seconds
2016-04-19 15:05:18 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es2017 is OK: OK slave_sql_lag Replication lag: 0.63 seconds
2016-04-19 15:05:19 <mark> does the same hold for stream.wikimedia.org etc?
2016-04-19 15:05:19 <jynus> ori, it could be one of those dbs that handle recent changes
2016-04-19 15:05:23 <Krenair> yes mark
2016-04-19 15:05:28 <jynus> which wiki, ori?
2016-04-19 15:05:33 <Krenair> Krinkle, FYI, RC is down but editing is allowed
2016-04-19 15:05:35 <ori> en (https://en.wikipedia.org/wiki/Special:RecentChanges)
2016-04-19 15:05:38 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1059 is OK: OK slave_sql_lag Replication lag: 0.04 seconds
2016-04-19 15:05:40 <_joe_> ori: we're not enqueueing jobs it seems https://grafana.wikimedia.org/dashboard/db/job-queue-health
2016-04-19 15:05:46 <wikibugs> Operations, Commons, MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (Addshore) >>! In T132921#2218096, @Bawolff wrote: >>"Unable to delete file pages on commons: "Could not acquire lock"". > > Locks use re...
2016-04-19 15:05:47 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
2016-04-19 15:05:55 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.45 seconds
2016-04-19 15:06:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.49 seconds
2016-04-19 15:06:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1056 is OK: OK slave_sql_lag Replication lag: 0.24 seconds
2016-04-19 15:06:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db1050 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
2016-04-19 15:06:03 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.28 seconds
2016-04-19 15:06:18 <wikibugs> Operations, Commons, MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218103 (matmarex) Switchover is today, these go as far back as April 4 (T131769, should be duped).
2016-04-19 15:06:22 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.21 seconds
2016-04-19 15:06:23 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Replication lag: 0.46 seconds
2016-04-19 15:06:26 <_joe_> but well, that graph is clearly broken
2016-04-19 15:06:31 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.45 seconds
2016-04-19 15:06:42 <_joe_> so there is some problem with locks on redis for images?
2016-04-19 15:06:42 <ori> AaronSchulz: are you around?
2016-04-19 15:06:48 <_joe_> yeah we need aaron
2016-04-19 15:06:52 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db1061 is OK: OK slave_sql_lag Replication lag: 0.47 seconds
2016-04-19 15:06:52 <icinga-wm> RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 0.40 seconds
2016-04-19 15:06:53 <MatmaRex> _joe_: yes, but unrelated to swithcover
2016-04-19 15:06:55 <bblack> well, my PURGE volume still isn't up to speed either, which is also jobq-driven mostly. it started to ramp up, but then didn't really
2016-04-19 15:07:00 <godog> _joe_: looks like predating the switchover
2016-04-19 15:07:00 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
2016-04-19 15:07:05 <bd808> hhvm logs are full of redis connection errors -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm
2016-04-19 15:07:08 <icinga-wm> RECOVERY - MariaDB Slave Lag: es3 on es2019 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
2016-04-19 15:07:09 <MatmaRex> _joe_: but the JobQueueErrors are related to switchover
2016-04-19 15:07:12 <MatmaRex> separate issues
2016-04-19 15:07:29 <_joe_> rdb2005.eqiad?
2016-04-19 15:07:30 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
2016-04-19 15:07:32 <_joe_> who did that
2016-04-19 15:07:35 <_joe_> me probably
2016-04-19 15:07:36 <_joe_> idiot
2016-04-19 15:07:39 <icinga-wm> RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.40 seconds
2016-04-19 15:07:40 <paravoid> haha
2016-04-19 15:07:51 <paravoid> yyup
2016-04-19 15:08:00 <paravoid> $wmfAllServices['codfw']['jobqueue_redis'] = array(
2016-04-19 15:08:02 <paravoid> is all broken
2016-04-19 15:08:10 <_joe_> paravoid: fixing now
2016-04-19 15:08:13 <paravoid> alright
2016-04-19 15:08:20 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es2016 is OK: OK slave_sql_lag Replication lag: 0.49 seconds
2016-04-19 15:08:31 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es1011 is OK: OK slave_sql_lag Replication lag: 0.14 seconds
2016-04-19 15:08:42 <paravoid> _joe_: $wmfAllServices['codfw']['jobqueue_aggregator'] = array(
2016-04-19 15:08:44 <paravoid> too
2016-04-19 15:09:07 <jynus> volans, what priorities do you see regarding dbs?
2016-04-19 15:09:11 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es1015 is OK: OK slave_sql_lag Replication lag: 0.02 seconds
2016-04-19 15:09:29 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es2015 is OK: OK slave_sql_lag Replication lag: 0.15 seconds
2016-04-19 15:09:34 <grrrit-wm> (PS1) Giuseppe Lavagetto: Fix codfw redis hostnames [mediawiki-config] - https://gerrit.wikimedia.org/r/284201
2016-04-19 15:09:35 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (Gehel)
2016-04-19 15:09:43 <_joe_> paravoid: ^^ a quick look?
2016-04-19 15:09:49 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es2014 is OK: OK slave_sql_lag Replication lag: 0.36 seconds
2016-04-19 15:09:50 <icinga-wm> RECOVERY - MariaDB Slave Lag: es2 on es1013 is OK: OK slave_sql_lag Replication lag: 0.41 seconds
2016-04-19 15:09:59 <paravoid> looks good
2016-04-19 15:10:03 <grrrit-wm> (CR) BBlack: [C: ] Fix codfw redis hostnames [mediawiki-config] - https://gerrit.wikimedia.org/r/284201 (owner: Giuseppe Lavagetto)
2016-04-19 15:10:07 <ori> already syncing live-hacked fix
2016-04-19 15:10:14 <grrrit-wm> (CR) Giuseppe Lavagetto: [C: 2] Fix codfw redis hostnames [mediawiki-config] - https://gerrit.wikimedia.org/r/284201 (owner: Giuseppe Lavagetto)
2016-04-19 15:10:16 <Krenair> You fixed RC, ori?
2016-04-19 15:10:16 <ori> identical to change
2016-04-19 15:10:17 <_joe_> ori: aha
2016-04-19 15:10:18 <logmsgbot> !log ori@tin Synchronized wmf-config/ProductionServices.php: live-hack fix for rdb2*.eqiad (duration: 00m 34s)
2016-04-19 15:10:22 <_joe_> Krenair: yes
2016-04-19 15:10:22 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:10:27 <volans> jynus: a tail of all error logs for anomalies, checking tendril for performances and loads on the hosts, keeping an eye on icinga
2016-04-19 15:10:30 <ori> rc is back
2016-04-19 15:10:32 <ori> purges should be too
2016-04-19 15:10:36 <paravoid> gehel: the wqds breakage is probably related to the RC breakage
2016-04-19 15:10:38 <Krenair> confirmed
2016-04-19 15:10:46 <jynus> volans, I think x1 is a hot spot, there are multiple replicaion breakages
2016-04-19 15:10:53 <_joe_> I had serveral reviewers here....
2016-04-19 15:10:56 <_joe_> bd808: thanks
2016-04-19 15:11:13 <addshore> oooh, got this interesting one, may be related to the switchover, https://www.wikidata.org/wiki/Q23889824 Exception encountered, of type "BadMethodCallException"
2016-04-19 15:11:15 <bblack> FWIW: git grep -E '[a-z]+2[0-9]+\.eqiad' on mediawiki-config says no such other errant hostnames
2016-04-19 15:11:25 <mark> we should get a CI check for "hostname2xx.eqiad" etc
2016-04-19 15:11:30 <volans> I was looking at it just now from icinga
2016-04-19 15:11:36 <jynus> volans, I think most of the current alerts are due to the old masters
2016-04-19 15:11:38 <_joe_> bd808: what is Warning: Cannot modify header information - headers already sent in /srv/me...
2016-04-19 15:11:44 <_joe_> there are a ton of those too
2016-04-19 15:11:48 <_joe_> they sound dangerous
2016-04-19 15:11:49 <moritzm> argon is also back broadcasting rc changes
2016-04-19 15:11:52 <ostriches> mark: Should be easy, filing a task.
2016-04-19 15:11:54 <andrewbogott> !log testing the log by logging a test
2016-04-19 15:11:59 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:12:04 <_joe_> like someone leaving a blank space somewhere
2016-04-19 15:12:08 <gehel> paravoid: thanks, I'm still trying to understand where the wdqs updates come from... but it seems that wdqs-updater does polling, not RCStream...
2016-04-19 15:12:11 <bd808> _joe_: something trying to set a cookie late I would guess, but let me look
2016-04-19 15:12:13 <Krenair> ori, so is RC going to be repopulated?
2016-04-19 15:12:17 <jynus> trying flow
2016-04-19 15:12:26 <volans> jynus: apart x1 yes, all the alarms are on eqiad for DBs
2016-04-19 15:12:32 <ori> Krenair: probably not
2016-04-19 15:12:47 <ori> I am aware that that is an issue
2016-04-19 15:12:52 <Krenair> ori, so what are we going to do about any vandalism occurring between editing being reallowed and RC being fixed?
2016-04-19 15:12:54 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218144 (Gehel) Might be related to RCStream breaking during the switch.
2016-04-19 15:13:11 <_joe_> Krenair: we're still looking at things now
2016-04-19 15:13:17 <ori> come across it and fix it
2016-04-19 15:13:22 <jynus> volans, due to puppet delay, pt-heartbeat was not executed properly just after the kill, but that did not cause user problems this time
2016-04-19 15:13:25 <mobrovac> it seems RB started getting some reqs from the job queue
2016-04-19 15:13:29 <_joe_> can we leave discussions for later, please?
2016-04-19 15:13:32 <ori> yeah, jq is back
2016-04-19 15:13:33 <bblack> there was always going to be such a window in the procedure, it just went a little longer than expected
2016-04-19 15:13:38 <_joe_> mobrovac: yeah the problem was the misconfiguration
2016-04-19 15:13:44 <mobrovac> kk
2016-04-19 15:13:47 <_joe_> bblack: not longer than I expected
2016-04-19 15:13:51 <Keegan> I have successfully deleted spam
2016-04-19 15:13:52 <AaronSchulz> reads backscroll
2016-04-19 15:13:55 <Keegan> Great work!!!!!
2016-04-19 15:14:07 <paravoid> so, done with MW?
2016-04-19 15:14:11 <paravoid> shall we move on to traffic?
2016-04-19 15:14:16 <_joe_> paravoid: I think so yes
2016-04-19 15:14:17 <wikibugs> Operations, Continuous-Integration-Config, Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2218145 (demon)
2016-04-19 15:14:21 <bd808> _joe_: the function that is erroring is WebResponse::header(). Will need to find stacktrace to figure out what header is trying to go out.
2016-04-19 15:14:24 <bblack> do we want to further confirm no need to rollback on any MW before moving traffic?
2016-04-19 15:14:33 <mark> no need to rush into traffic I think?
2016-04-19 15:14:37 <_joe_> bd808: I think it stopped
2016-04-19 15:14:44 <ori> AaronSchulz: tl;dr: codfw aggregators were "rdb200x.eqiad" instead of "rdb200x.codfw", just a typo. but if there's a way to generate RC events for revisions created during that time that would be good.
2016-04-19 15:15:00 <Krenair> there is rebuildrecentchanges.php
2016-04-19 15:15:09 <jynus> I would like to test Recent changes on several wikis
2016-04-19 15:15:15 <Krenair> it takes several hours
2016-04-19 15:15:18 <_joe_> AaronSchulz: and you did review that change :P
2016-04-19 15:15:23 <jynus> it is usually a cause of problems for my service
2016-04-19 15:15:24 <_joe_> so it's on me and on you
2016-04-19 15:15:59 <bd808> _joe_: *nod* it could have been a result of the redis problem. If MW was trying to send an error response after it had already spit out the page that error would happen
2016-04-19 15:16:10 <_joe_> bd808: I think it is, yes
2016-04-19 15:16:12 <ori> it's not on anyone, blame isn't useful or nice. we just need more safeguards for that next time
2016-04-19 15:16:13 <bblack> to be clear, the discussed traffic steps are in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Specifics_for_Switchover_Test_Week - this moves user routing so the last-mile isn't eqiad-varnish -> codfw-apps, and gets users off of eqiad frontends too.
2016-04-19 15:16:27 <_joe_> ori: I'm just joking, I thought it was clear
2016-04-19 15:16:32 <ori> ok :)
2016-04-19 15:16:47 <bd808> I saw the pile of errors and "eqiad" didn't jump out at me either
2016-04-19 15:16:56 <ori> bblack: I think that's good to go from MW's perspective, I can't imagine we'd roll back now
2016-04-19 15:16:57 <_joe_> the :P was just to avoid that kind of confusion :)
2016-04-19 15:17:11 <MatmaRex> Krenair: ori: rebuildrecentchanges starts with a DELETE * FROM, though.
2016-04-19 15:17:13 <_joe_> ori: let's wait for everyone to confirm we're ok
2016-04-19 15:17:22 <jynus> volans, I think you may have killed heartbeat on db1001, which is misc and not part of the failover, can you confirm (if not, it is a bug)
2016-04-19 15:17:29 <MatmaRex> and it's kind of lame. loses patrolling information, for example
2016-04-19 15:17:35 <volans> jynus: double checking
2016-04-19 15:17:37 <jynus> (not an issue, either)
2016-04-19 15:17:38 <MatmaRex> so just running it might do a disservice
2016-04-19 15:17:40 <ori> yeah, I wouldn't risk it
2016-04-19 15:17:50 <godog> edit.success in graphite now close to pre-switchover levels
2016-04-19 15:17:52 <ori> deletes won't make jynus happy either
2016-04-19 15:17:53 <MatmaRex> i agree it would be good to repopulate if possible. but we'd need to tweak it
2016-04-19 15:17:58 <jynus> volans, if you killed it, it is good news
2016-04-19 15:17:58 <_joe_> btw, we've dropped the jobqueue repeateldy in the past
2016-04-19 15:18:04 <ori> godog: great idea to check that, thanks
2016-04-19 15:18:09 <MatmaRex> (i guess to just rebuild in the affected time range, rather than all)
2016-04-19 15:18:21 <volans> jynus: no I didn't
2016-04-19 15:18:23 <_joe_> as in, it crashed and the redis db was corrupted
2016-04-19 15:18:25 <subbu> parsoid codfw cluster is slowing getting back to "normal" from 0% .. courtesy jobrunners. grafana oldid wt2html rates going back up to previous levels. verified ve edit rc stream on enwiki and itwiki .. so, all good from the parsoid side.
2016-04-19 15:18:26 <jynus> mmm
2016-04-19 15:18:26 <_joe_> before ori fixed it
2016-04-19 15:18:27 <volans> sorry
2016-04-19 15:18:34 <jynus> then puppet killed it :-/
2016-04-19 15:18:39 <paravoid> subbu: thanks subbu
2016-04-19 15:18:46 <_joe_> subbu: great!
2016-04-19 15:19:04 <godog> ori: np, also added a learning/question on why it never dropped to zero
2016-04-19 15:19:20 <paravoid> ori, bblack: the only reason that I can think of for holding traffic a little while longer would be performance metrics
2016-04-19 15:19:27 <ori> statsd buffers
2016-04-19 15:19:28 <paravoid> as in, if we want to measure the two events independently
2016-04-19 15:20:19 <bblack> I just want to be sure there's no lingering issues with the final bits (maint/jq) that are going to cause us to want to undo the switch
2016-04-19 15:20:22 <_joe_> the queue became way larger, me might need to check it if it doesn't reduce in reasonable times
2016-04-19 15:20:24 <bblack> if we're confident on that, I'm ok
2016-04-19 15:20:28 <ori> i went into this with the mindframe that performance takes a back seat to correctness / availability, so my preference would be to stick with the process and run separate controlled experiments if we want to find out more about the impact of these routing changes
2016-04-19 15:20:30 <paravoid> I doubt we'd undo at this point
2016-04-19 15:20:37 <_joe_> bblack: as far as maint is concerned, there is no issue
2016-04-19 15:20:47 <bblack> ok
2016-04-19 15:21:03 <ori> i'll be back in <5m
2016-04-19 15:21:06 <_joe_> jobqueue seems ok to me, but I'd like confirmation on changes
2016-04-19 15:21:14 <jynus> I am currently happy, given some issue, the only reason I would rollback now is if there were issues with redis/swift
2016-04-19 15:21:16 <_joe_> honestly, I'd wait for wikidata sync to recover
2016-04-19 15:21:52 <addshore> wikidata dispatch lag is heading down now :)
2016-04-19 15:22:20 <bblack> analytics1052 has lots of issues in icinga, may be failing, probably unrelated?
2016-04-19 15:22:23 <icinga-wm> RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1677 bytes in 0.194 second response time
2016-04-19 15:22:36 <_joe_> oh shit
2016-04-19 15:22:46 <bblack> and mira still unmerged changes in mediawiki_config
2016-04-19 15:22:49 <_joe_> did we merge wait
2016-04-19 15:22:52 <icinga-wm> PROBLEM - configured eth on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:00 <mark> _joe_: elaborate?
2016-04-19 15:23:07 <_joe_> sorry false alarm
2016-04-19 15:23:11 <icinga-wm> PROBLEM - Check size of conntrack table on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:11 <icinga-wm> PROBLEM - Disk space on Hadoop worker on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:11 <icinga-wm> PROBLEM - puppet last run on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:12 <ori> i merged on tin but did not sync because it's a no-op, identical to what i deployed before
2016-04-19 15:23:19 <ori> i'll sync it anyway to quiet mira
2016-04-19 15:23:22 <icinga-wm> PROBLEM - DPKG on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:35 <_joe_> I could not find trace of $wmfMasterDatacenter = 'codfw'; in grepping mediawiki-config
2016-04-19 15:23:38 <icinga-wm> PROBLEM - Hadoop JournalNode on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:41 <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:23:43 <_joe_> and I was just on the wrong branch
2016-04-19 15:23:53 <icinga-wm> PROBLEM - salt-minion processes on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:24:01 <icinga-wm> PROBLEM - Hadoop DataNode on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:24:01 <icinga-wm> PROBLEM - RAID on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:24:10 <bblack> while I'm running through icingas: pc2005 has a disk space warning since ~2h ago
2016-04-19 15:24:25 <akosiaris> I wonder if analytics1052 is related to the wake of the switchover
2016-04-19 15:24:43 <icinga-wm> PROBLEM - Disk space on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:24:51 <icinga-wm> PROBLEM - dhclient process on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:24:53 <icinga-wm> PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 15:25:03 <grrrit-wm> (PS1) Elukey: Add the possibility to set an external database for Hue. [puppet/cdh] - https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990)
2016-04-19 15:25:06 <logmsgbot> !log ori@tin Synchronized wmf-config/ProductionServices.php: Iee2e08df5: Fix codfw redis hostnames [no-op, already synced as live hack] (duration: 00m 36s)
2016-04-19 15:25:09 <volans> bblack: known, no problem there, thanks (in the sense that we'll clean up some stuff later)
2016-04-19 15:25:10 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:25:31 <elukey> checking analytics1052
2016-04-19 15:26:15 <bblack> paravoid: ok so, move on traffic?
2016-04-19 15:26:43 <icinga-wm> RECOVERY - Disk space on analytics1052 is OK: DISK OK
2016-04-19 15:26:52 <icinga-wm> RECOVERY - dhclient process on analytics1052 is OK: PROCS OK: 0 processes with command name dhclient
2016-04-19 15:27:01 <icinga-wm> RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING
2016-04-19 15:27:03 <elukey> didn't do anything, wasn't able to ssh, temp glitch?
2016-04-19 15:27:12 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218221 (Addshore)
2016-04-19 15:27:13 <icinga-wm> RECOVERY - configured eth on analytics1052 is OK: OK - interfaces up
2016-04-19 15:27:30 <paravoid> bblack: yes please
2016-04-19 15:27:31 <_joe_> paravoid, bblack I'd just love to see the wikidata alert come back
2016-04-19 15:27:32 <icinga-wm> RECOVERY - Disk space on Hadoop worker on analytics1052 is OK: DISK OK
2016-04-19 15:27:32 <icinga-wm> RECOVERY - Check size of conntrack table on analytics1052 is OK: OK: nf_conntrack is 0 % full
2016-04-19 15:27:32 <icinga-wm> RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
2016-04-19 15:27:42 <_joe_> but nothing to interfere with traffic
2016-04-19 15:27:42 <icinga-wm> RECOVERY - DPKG on analytics1052 is OK: All packages OK
2016-04-19 15:27:55 <ottomata> weird
2016-04-19 15:28:02 <icinga-wm> RECOVERY - Hadoop JournalNode on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode
2016-04-19 15:28:13 <icinga-wm> RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
2016-04-19 15:28:14 <bblack> _joe_: wikidata lag alert did go away
2016-04-19 15:28:21 <_joe_> bblack: yeah just saw
2016-04-19 15:28:22 <icinga-wm> RECOVERY - salt-minion processes on analytics1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
2016-04-19 15:28:23 <icinga-wm> RECOVERY - Hadoop DataNode on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
2016-04-19 15:28:23 <icinga-wm> RECOVERY - RAID on analytics1052 is OK: OK: optimal, 13 logical, 14 physical
2016-04-19 15:28:32 <_joe_> gehel: wdqs still not updating?
2016-04-19 15:28:36 <bblack> elukey: analytics1052 had very high iowait in ganglia while it was dead in icinga
2016-04-19 15:28:38 <gehel> _joe_, bblack: I acked the WDQS alert
2016-04-19 15:28:39 <_joe_> I guess it should be ok now
2016-04-19 15:28:41 <ottomata> looks like icinga glitch for an52? uptime 34 days
2016-04-19 15:28:53 <ori> hacker news is downish, silicon valley in crisis
2016-04-19 15:28:57 <gehel> but yes, it is back to normal
2016-04-19 15:29:20 <akosiaris> ottomata: nope. look at https://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&h=analytics1052.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2
2016-04-19 15:29:22 <_joe_> ori: we'll discover they were hosting their website on the rc feeds
2016-04-19 15:29:22 <grrrit-wm> (PS3) BBlack: codfw switch: codfw text caches -> direct [puppet] - https://gerrit.wikimedia.org/r/283430
2016-04-19 15:29:30 <akosiaris> there was clearly something going on
2016-04-19 15:30:08 <wikibugs> Operations, ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218234 (Papaul) |**server name **|**rack location **| |restbase2001|B5| |restbase2002|B8| |restbase2003|C1| |restbase2004|C5| |restbase2005|D1| |restbase2006|D5| layout option |**server name **|**rack...
2016-04-19 15:30:20 <volans> jynus: for x1 how you want to proceed? we have ROW binlog format on db2009
2016-04-19 15:30:22 <grrrit-wm> (CR) BBlack: [C: 2 V: 2] codfw switch: codfw text caches -> direct [puppet] - https://gerrit.wikimedia.org/r/283430 (owner: BBlack)
2016-04-19 15:30:27 <bblack> !log [traffic codfw switch #1] - puppet merging text caches -> direct
2016-04-19 15:30:32 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:30:42 <jynus> is the within-codfw working, right?
2016-04-19 15:30:51 <volans> yes
2016-04-19 15:31:04 <bblack> !log [traffic codfw switch #1] - salting puppet change
2016-04-19 15:31:08 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:31:20 <jynus> volans, because at this point, I would upgrade the old master (aka failover to a 10 slave)
2016-04-19 15:31:53 <jynus> let's make a plan of things we have to do in these 48 hours
2016-04-19 15:32:03 <jynus> but this is oftopic from here
2016-04-19 15:32:32 <volans> yes we can continue on -databases
2016-04-19 15:32:34 <jynus> it should be part of the eqiad maintenance
2016-04-19 15:32:48 <jynus> let me first give a general check to codfw
2016-04-19 15:33:08 <jynus> and maybe taking a break wehn things stabilize
2016-04-19 15:33:23 <volans> sure
2016-04-19 15:34:05 <grrrit-wm> (PS3) BBlack: codfw switch: geodns depool text services from eqiad [dns] - https://gerrit.wikimedia.org/r/283433
2016-04-19 15:34:14 <bblack> !log [traffic codfw switch #1] - puppet change complete - done
2016-04-19 15:34:18 <jynus> sorry I was a bit disconnected, what is the current status, mediawiki ok, pending traffic?
2016-04-19 15:34:18 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:34:37 <ori> jynus: correct
2016-04-19 15:34:52 <icinga-wm> RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0]
2016-04-19 15:35:10 <grrrit-wm> (CR) BBlack: [C: 2] codfw switch: geodns depool text services from eqiad [dns] - https://gerrit.wikimedia.org/r/283433 (owner: BBlack)
2016-04-19 15:35:11 <_joe_> gehel: :)
2016-04-19 15:35:54 <bblack> !log [traffic codfw switch #2] - authdns-update complete, user traffic to eqiad frontends should start dropping off now
2016-04-19 15:35:58 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:36:00 <gehel> _joe_: I have to admit I did not understand everything...
2016-04-19 15:36:15 <grrrit-wm> (PS3) BBlack: codfw switch: esams text caches -> codfw [puppet] - https://gerrit.wikimedia.org/r/283431
2016-04-19 15:36:19 <_joe_> gehel: it's probably fed by the jobqueue or a maintenance script
2016-04-19 15:36:27 <_joe_> which we stopped during the switchover
2016-04-19 15:36:41 <addshore> gehel: _joe_ yeh I think that is the case (if your talking about WDQS)
2016-04-19 15:36:45 <gehel> _joe_: looking at the code, it seems to call api.php ...
2016-04-19 15:36:46 <wikibugs> Operations, Commons, MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218251 (matmarex)
2016-04-19 15:36:53 <paravoid> bblack: shoot if you need more eyes and/or hands :)
2016-04-19 15:37:03 <_joe_> yeah that ^^
2016-04-19 15:37:05 <_joe_> :)
2016-04-19 15:37:09 <addshore> gehel: oohh, what in api.php? ;)
2016-04-19 15:37:30 <bblack> so far all smooth - keep in mind all of this is pre-tested for other clusters/scenarios :)
2016-04-19 15:37:54 <gehel> addshore: something similar to curl -v -s https://www.wikidata.org/w/api.php?format=json\&action=query\&list=recentchanges\&rcdir=newer\&rcprop=title\|ids\|timestamp\&rclimit=10\&rcstart=20160404000000
2016-04-19 15:38:07 <grrrit-wm> (CR) BBlack: [C: 2 V: 2] codfw switch: esams text caches -> codfw [puppet] - https://gerrit.wikimedia.org/r/283431 (owner: BBlack)
2016-04-19 15:38:25 <wikibugs> Operations, Commons, MediaWiki-Page-deletion, media-storage: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (matmarex)
2016-04-19 15:38:32 <addshore> oooh, _joe_ recentchanges wouldn't actually be populated for the period it was broken right?
2016-04-19 15:38:35 <bblack> !log [traffic codfw switch #3] - puppet merging esams text -> codfw
2016-04-19 15:38:39 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:39:36 <bblack> !log [traffic codfw switch #3] - salting puppet change
2016-04-19 15:39:37 <jynus> volans, I am going to update the masters on tendril to get a better picture (it is a rown on db1011 database)
2016-04-19 15:39:40 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:40:00 <addshore> infact, per once of my test edits on test.wikipedia.org not appearing in recentchanges i'll say no, which means the WDQS is going to be missing a handfull of edits no gehel!
2016-04-19 15:40:02 <volans> jynus: ok
2016-04-19 15:40:31 <addshore> if it actually gets the changes from recentchanges and the api in that way
2016-04-19 15:41:32 <gehel> addshore: not sure I read the code correctly, I'll check with SMalyshev when he arrives
2016-04-19 15:41:38 <bblack> !log [traffic codfw switch #3] - puppet change complete - done
2016-04-19 15:41:42 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:42:00 <addshore> gehel: cool, yeh I wouldn't remember how it works without diving through the code again
2016-04-19 15:42:13 <mark> bblack: all done?
2016-04-19 15:42:23 <godog> still seeing some traffic on eqiad imagescalers, possibly related to swift proxy processes still running after reload
2016-04-19 15:42:31 <grrrit-wm> (PS3) BBlack: codfw switch: eqiad text caches -> codfw [puppet] - https://gerrit.wikimedia.org/r/283432
2016-04-19 15:43:01 <bblack> mark: in practice yes, but still waiting for (a) eqiad frontend users to finish draining out from DNS TTL
2016-04-19 15:43:06 <mark> ok
2016-04-19 15:43:17 <bblack> + (b) reconfirming #1 is definitely done before #4, so we don't cause loops
2016-04-19 15:43:20 <wikibugs> Operations, Commons, MediaWiki-Page-deletion, media-storage: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218261 (matmarex) I can't get the full backtraces from logstash (they're truncated there), but all of these exceptions are "Co...
2016-04-19 15:43:35 <bblack> but #4 is just to catch users stuck in eqiad with bad DNS, doesn't affect much load/traffic in practice
2016-04-19 15:43:54 <mark> i'll start preparing an update mail to be sent out
2016-04-19 15:44:15 <_joe_> /win 17
2016-04-19 15:45:15 <bblack> #1 confirmed
2016-04-19 15:45:32 <bblack> !log [traffic codfw switch #4] - puppet merging eqiad text -> codfw
2016-04-19 15:45:36 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:45:41 <grrrit-wm> (CR) BBlack: [C: 2 V: 2] codfw switch: eqiad text caches -> codfw [puppet] - https://gerrit.wikimedia.org/r/283432 (owner: BBlack)
2016-04-19 15:46:39 <bblack> !log [traffic codfw switch #4] - salting puppet change
2016-04-19 15:46:43 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:49:00 <bblack> !log [traffic codfw switch #2] - confirmed bulk of traffic moved after ~10min for DNS TTL, rates levelling out on eqiad+codfw front network stats
2016-04-19 15:49:05 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:49:21 <bblack> !log [traffic codfw switch #4] - puppet change complete - done
2016-04-19 15:49:25 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 15:49:42 <bblack> that's it for the traffic changes
2016-04-19 15:49:51 <paravoid> \o/
2016-04-19 15:50:34 <_joe_> \o/
2016-04-19 15:51:35 <volans> \o/
2016-04-19 15:51:57 <_joe_> we're not good at dancing, are we
2016-04-19 15:52:13 <urandom> looks at his feet
2016-04-19 15:52:18 <godog> ,o/ ,o/ ,o/
2016-04-19 15:52:43 <wikibugs> Operations, Wikidata, codfw-rollout: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048#2218157 (Addshore)
2016-04-19 15:52:45 <akosiaris> ahahahahah
2016-04-19 15:52:51 <akosiaris> travolta ftw
2016-04-19 15:52:55 <akosiaris> godog: good one!
2016-04-19 15:53:48 <godog> hahah thanks akosiaris \o.
2016-04-19 15:53:50 <akosiaris> I was going to /o/ (hey) \o\ (ho) /o/ (hey) \o\ (ho) but yours is better
2016-04-19 15:55:13 <_joe_> akosiaris: we know you can dance...
2016-04-19 15:55:13 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218336 (Gehel) Looking at the code, it seems that updates are fetched by using the MW API. Doing the call manually from wdqs1001...
2016-04-19 15:55:16 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218337 (matmarex)
2016-04-19 15:56:11 <_joe_> gehel: which url for the api is used by wdqs?
2016-04-19 15:56:15 <subbu> congratulations everyone on the successful dc switch. :)
2016-04-19 15:56:17 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218353 (Gehel) This is related to T133053.
2016-04-19 15:56:35 <wikibugs> Operations, MediaWiki-General-or-Unknown, Traffic, Wikimedia-General-or-Unknown, HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2218358 (BBlack)
2016-04-19 15:56:38 <wikibugs> Operations, Traffic, HTTPS, Patch-For-Review, Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2218356 (BBlack) Open>Resolved All Set-Cookie: emitted by varnish have the secure flag
2016-04-19 15:56:58 <gehel> _joe_: I reconstructed it from looking at the code (so I might be wrong) but it looks like ; curl -v -s https://www.wikidata.org/w/api.php?format=json\&action=query\&list=recentchanges\&rcdir=newer\&rcprop=title\|ids\|timestamp\&rclimit=100
2016-04-19 15:57:13 <addshore> _joe_: gehel the query service is updating now https://grafana.wikimedia.org/dashboard/db/wikidata-query-service but if things were missing from RC then it will be missing data
2016-04-19 15:57:38 <_joe_> addshore: ack
2016-04-19 15:58:06 <_joe_> and thanks for looking :)
2016-04-19 15:58:22 <gehel> addshore: we need to reinstall one of the wdqs1001 server and do a full data load, so problem will be solved at that point for this server. For wdqs1002, we'll have to find a solution...
2016-04-19 15:58:36 <addshore> gehel: cool cool!
2016-04-19 15:58:47 <addshore> runs away to keep looking at https://phabricator.wikimedia.org/T133048
2016-04-19 15:59:31 <csteipp> Is there a tracking task / project for bugs related to the switch?
2016-04-19 16:00:01 <bblack> we should perhaps make one
2016-04-19 16:00:14 <bblack> for this particular date too, since there will be future switch tests
2016-04-19 16:00:57 <mark> i mailed one last week
2016-04-19 16:01:00 <mark> and again just now
2016-04-19 16:01:07 <mark> #codfw-rollout
2016-04-19 16:01:18 <bblack> ok
2016-04-19 16:01:22 <mark> it was deemed overkill to create a separate one
2016-04-19 16:01:32 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218390 (csteipp)
2016-04-19 16:03:22 <grrrit-wm> (CR) MGChecker: [C: -1] [WIP] Let Wikidata editors edit at a higher rate than on other wikis [mediawiki-config] - https://gerrit.wikimedia.org/r/280003 (owner: Jforrester)
2016-04-19 16:06:52 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218396 (IKhitron) Well, Quarry ignores queries on the lost time in recentchanges table, but has all the data in revision table.
2016-04-19 16:06:55 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218397 (IKhitron) Well, Quarry ignores queries on the lost time in recentchanges table, but has all the data in revision table.
2016-04-19 16:08:46 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218323 (matmarex) There is a script called rebuildrecentchanges.php, but it would need some adjustments to work on a time range (right now it c...
2016-04-19 16:09:42 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218405 (IKhitron) So, it will be OK!
2016-04-19 16:14:02 <grrrit-wm> (CR) Dzahn: [C: -1] "it's called furud instead of yurud" [puppet] - https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: Chad)
2016-04-19 16:14:17 <grrrit-wm> (PS2) Chad: Use legacy key exchanges on furud, like antimony [puppet] - https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718)
2016-04-19 16:14:34 <grrrit-wm> (PS3) Dzahn: Use legacy key exchanges on furud, like antimony [puppet] - https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: Chad)
2016-04-19 16:14:40 <grrrit-wm> (Draft1) Addshore: Add ganglia link to codfw too [software/tendril] - https://gerrit.wikimedia.org/r/284184
2016-04-19 16:15:07 <grrrit-wm> (CR) Dzahn: [C: 2] Use legacy key exchanges on furud, like antimony [puppet] - https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: Chad)
2016-04-19 16:16:03 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218422 (Addshore) >>! In T133046#2218396, @IKhitron wrote: > Well, Quarry ignores queries on the lost time in recentchanges table...
2016-04-19 16:21:47 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218464 (Deskana)
2016-04-19 16:22:37 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (Deskana) I've updated this task with some of the informati...
2016-04-19 16:27:18 <Krenair> is labs replication still working?
2016-04-19 16:28:20 <volans> Krenair: yes
2016-04-19 16:28:40 <Krenair> max(rc_timestamp) from enwiki_p.recentchanges is 20160419083717
2016-04-19 16:29:23 <Krenair> on labsdb1001
2016-04-19 16:29:50 <Krenair> on labsdb1003: 20160419162942
2016-04-19 16:30:28 <volans> yes labsdb1001 is delayed
2016-04-19 16:30:37 <Krenair> ok
2016-04-19 16:32:07 <Luke081515> I was away long time... Datacenter switch happend without big problems?
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1023 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1033 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1038 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1040 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1052 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:32:14 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1058 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h
2016-04-19 16:33:08 <Krenair> Luke081515, well we lost a load of edits from RC
2016-04-19 16:33:52 <Luke081515> Krenair: :-/ I already saw that at the steward channel. Apart from that all ok?
2016-04-19 16:34:18 <addshore> Luke081515: mainly, so far :)
2016-04-19 16:34:40 <Luke081515> great :)
2016-04-19 16:35:45 <_joe_> Luke081515: overall it went quite well, I would say, yes
2016-04-19 16:36:21 <Luke081515> _joe_: Good. IIRC the next switch is tomorrow, or I'm wrong?
2016-04-19 16:36:29 <akosiaris> thursday
2016-04-19 16:36:33 <_joe_> no, thursday
2016-04-19 16:36:36 <_joe_> in 46 hours
2016-04-19 16:36:49 <_joe_> I didn't realize it was so near :(
2016-04-19 16:37:05 <_joe_> no time to rest on laurels it seems
2016-04-19 16:37:22 <Luke081515> gah, I meant thrusday, my fault ;)
2016-04-19 16:38:41 <mark> gehel: what's the remaining traffic on elasticsearch in eqiad?
2016-04-19 16:38:51 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check
2016-04-19 16:38:53 <ebernhardson> mark: indexing, and maybe ttmserver
2016-04-19 16:39:03 <mark> ok
2016-04-19 16:40:59 <Nikerabbit> ttmserver is in eqiad per earlier discussions
2016-04-19 16:42:02 <_joe_> definitely ttmserver
2016-04-19 16:42:03 <gehel> mark: as ebernhardson said. I did a check before the switch, I could see almost only indexing traffic
2016-04-19 16:42:09 <_joe_> Nikerabbit: indeed
2016-04-19 16:42:38 <gehel> ttmserver is still in eqiad but is mostly insignificant compared to indexing traffic
2016-04-19 16:43:58 <_joe_> of course
2016-04-19 16:45:22 <wikibugs> Operations, ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218634 (fgiunchedi) ok, thanks @papaul ! if we have to co-locate in the same rack in row C that's fine too, I'll leave it to you whether C1 or C5
2016-04-19 16:45:34 <addshore> does anyone in here feel comfortable running a main script for wikidata to cleanup from the switchover? :P
2016-04-19 16:46:18 <_joe_> addshore: honestly, no :P but someone with more dev knowledge maybe
2016-04-19 16:46:37 <addshore> I probably would, but I dont have access ;)
2016-04-19 16:46:39 <_joe_> ori, AaronSchulz maybe?
2016-04-19 16:46:46 <_joe_> addshore: which script btw?
2016-04-19 16:46:51 <_joe_> I can take a look
2016-04-19 16:47:02 <addshore> see https://phabricator.wikimedia.org/T133048 rebuildEntityPerPage.php
2016-04-19 16:47:31 <icinga-wm> PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago
2016-04-19 16:48:57 <_joe_> addshore: that's a wikidata specific script, I have no idea how to use it
2016-04-19 16:48:57 <grrrit-wm> (PS1) Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348)
2016-04-19 16:49:04 <_joe_> but if it runs on single pages
2016-04-19 16:49:11 <_joe_> we can test one for sure
2016-04-19 16:49:17 <AaronSchulz> knows nothing about that script
2016-04-19 16:49:19 <mutante> hmm.. what's the puppet issue with bast1001? .looking
2016-04-19 16:49:36 <_joe_> AaronSchulz: heh it's wikidata-specific
2016-04-19 16:49:42 <icinga-wm> RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 16:50:10 <mutante> eh... ok
2016-04-19 16:50:34 <_joe_> addshore: how is that script run?
2016-04-19 16:52:03 <addshore> _joe_: should be run as any other exttension main script is run
2016-04-19 16:52:26 <_joe_> I see it repairs all of the table
2016-04-19 16:52:34 <grrrit-wm> (PS2) Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348)
2016-04-19 16:52:37 <addshore> yup, but will only fill in the gaps
2016-04-19 16:52:50 <grrrit-wm> (PS3) Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348)
2016-04-19 16:53:11 <grrrit-wm> (CR) Dzahn: [C: 2] statistics: rsync on stat1004 for stat1001 migration [puppet] - https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348) (owner: Dzahn)
2016-04-19 16:53:25 <grrrit-wm> (PS3) Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324)
2016-04-19 16:53:32 <_joe_> addshore: I'm honestly not that confident
2016-04-19 16:53:49 <addshore> thats fine, It can wait for someone else :)
2016-04-19 16:54:28 <addshore> may have to go look at what access he needs to run wikidata maint scripts...
2016-04-19 16:54:47 <grrrit-wm> (PS2) BBlack: varnish redir: wmfusercontent.org -> www.wikimedia.org [puppet] - https://gerrit.wikimedia.org/r/284112 (https://phabricator.wikimedia.org/T132452)
2016-04-19 16:55:13 <_joe_> addshore: do you have an account in prod?
2016-04-19 16:55:17 <addshore> yup
2016-04-19 16:55:54 <ostriches> !log restarting gerrit to pick up furud's rsa key
2016-04-19 16:55:58 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 16:56:24 <Krenair> addshore, to run maintenance scripts it's either restricted or deployment
2016-04-19 16:56:31 <wikibugs> Operations, ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218678 (Papaul) Thanks @fgiunchedi here is the final layout |**server name **|**rack location**| |restbase2007|B1| |restbase2008|C1| |restbase2009|D1|
2016-04-19 16:56:33 <addshore> *goes to look at those 2 groups*
2016-04-19 16:56:39 <mutante> normally maint scripts are run by cron, not people
2016-04-19 16:56:42 <bblack> I'm getting 503s on gerrit
2016-04-19 16:56:44 <mutante> they are on terbium
2016-04-19 16:56:47 <elukey> bblack: me too
2016-04-19 16:56:49 <_joe_> mutante: no!
2016-04-19 16:56:53 <_joe_> they are on wasat
2016-04-19 16:56:55 <bblack> nevermind, fixed itself
2016-04-19 16:56:55 <_joe_> :)
2016-04-19 16:57:01 <mutante> oops, heh, of course :)
2016-04-19 16:57:10 <ostriches> I logged it ;-)
2016-04-19 16:57:56 <bblack> grrrit-wm is not reporting in either
2016-04-19 16:58:07 <ostriches> It always dies after a gerrit kick.
2016-04-19 16:58:22 <addshore> Krenair: in that case maybe I will put a request in for restricted
2016-04-19 16:58:29 <ostriches> Krenair: Holp? grrrit-wm ^
2016-04-19 16:58:30 <bblack> mutante: ok to merge?
2016-04-19 16:58:51 <_joe_> addshore: request the ability to run mw maintenance scripts
2016-04-19 16:58:53 <mutante> bblack: yes please, the reason i didn't is that i got 503 from gerrit during puppet-merge because of the restart
2016-04-19 16:58:53 <mark> btw, we're enqueing job queue jobs faster than processing them atm
2016-04-19 16:58:55 <Krenair> ostriches, can't help due to https://phabricator.wikimedia.org/T132828
2016-04-19 16:58:56 <_joe_> it's a better request
2016-04-19 16:59:11 <mark> queue size is increasing
2016-04-19 16:59:14 <_joe_> mark: I suspect that has to do with some loop like the last time
2016-04-19 16:59:14 <ostriches> YuviPanda: Plz ^
2016-04-19 16:59:24 <Krenair> unless you want me to elevate my tools access to projectadmin :)
2016-04-19 17:00:21 <Krenair> but ops might not like that so much
2016-04-19 17:00:24 <_joe_> ori, AaronSchulz the queue size is unsurprisingly increasing; can either of you take a look? I'd say AaronSchulz given ori is awake since... I lost count
2016-04-19 17:01:41 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218697 (IKhitron) Well, I made the list of missing edits, and they are not unmarked!
2016-04-19 17:01:43 <ostriches> !log ytterbium: stopped puppet for a bit, testing host key mess.
2016-04-19 17:01:47 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218698 (IKhitron) Well, I made the list of missing edits, and they...
2016-04-19 17:01:48 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:01:50 <wikibugs> Operations, Ops-Access-Requests: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2218699 (Addshore)
2016-04-19 17:02:05 <addshore> _joe_: Krenair ^^
2016-04-19 17:02:53 <gehel> I had a reindex job in progress during the switchover. It now looks like those jobs are stuck in a job queue.
2016-04-19 17:03:00 <_joe_> addshore: it won't happen now btw
2016-04-19 17:03:11 <addshore> yeh I know :) X days or something ;)
2016-04-19 17:03:29 <_joe_> heh I feared I was crashing your expectations :P
2016-04-19 17:03:54 <addshore> it's not my first rodeo ;)
2016-04-19 17:03:55 <wikibugs> Operations, Traffic, HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2218726 (BBlack)
2016-04-19 17:03:57 <wikibugs> Operations, Traffic, Patch-For-Review: HSTS preload for wmfusercontent.org - https://phabricator.wikimedia.org/T132452#2218723 (BBlack) Open>Resolved a:BBlack The redirect above worked, this is submitted to the preload list now (will take the usual lag time to make it into browsers)
2016-04-19 17:04:11 <volans> and this is yet another behaviour of the jobqueue, went from 0 to 1.6M in 1 minute when started and then is going up
2016-04-19 17:04:16 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218728 (Smalyshev) Yes, WDQS uses recent changes API, so if that one is broken then updates are broken too.
2016-04-19 17:04:40 <mark> seems like ori and AaronSchulz went missing
2016-04-19 17:05:03 <paravoid> maybe Krinkle?
2016-04-19 17:05:32 <mark> I'm guessing that team hasn't renamed into "availability" just yet ;)
2016-04-19 17:06:33 <ostriches> mark: My 100% uptime is gonna suffer this month :(
2016-04-19 17:06:46 <paravoid> haha
2016-04-19 17:07:32 <paravoid> urandom: the SSTables alert is tripping again
2016-04-19 17:08:21 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218736 (Deskana)
2016-04-19 17:08:23 <mutante> stat1004 actually has Petabyte storage, wow
2016-04-19 17:08:23 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218737 (Deskana)
2016-04-19 17:08:39 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (Deskana)
2016-04-19 17:08:40 <paravoid> ostriches: speaking of your uptime :P -- I saw that mutante is moving gitblit to a new host; why aren't we just killing it?
2016-04-19 17:08:41 <YuviPanda> ostriches: I can do it yeah.
2016-04-19 17:08:41 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (Deskana)
2016-04-19 17:08:42 <volans> !log Deleting pc1002* old binlog from pc2005 to make some space
2016-04-19 17:08:45 <mutante> i'm not sure i saw a "P" in df -h before
2016-04-19 17:08:46 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:08:56 <wikibugs> Operations, MediaWiki-Recent-changes, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (Deskana)
2016-04-19 17:08:58 <paravoid> volans: pc2006 is warning too
2016-04-19 17:08:58 <wikibugs> Operations, Discovery, Wikidata, Wikidata-Query-Service, codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (Deskana)
2016-04-19 17:09:10 <volans> paravoid: yes will be next ;) thanks
2016-04-19 17:09:46 <ostriches> paravoid: Phabricator doesn't index non-committed revisions yet (things in gerrit but not yet merged). That's still used for repo browsing from Gerrit.
2016-04-19 17:09:54 <jynus> pc1002, shouldn't it be pc1005?
2016-04-19 17:10:04 <wikibugs> Operations, Analytics-Cluster, Analytics-Kanban, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218749 (Ottomata) a:Ottomata
2016-04-19 17:10:05 <ostriches> We're working on a patch, but haven't figured it out 100% yet.
2016-04-19 17:10:14 <volans> jynus: there are both
2016-04-19 17:10:18 <jynus> ha
2016-04-19 17:10:19 <volans> those are from Jan
2016-04-19 17:10:26 <jynus> yes, when hardware upgrade
2016-04-19 17:11:10 <wikibugs> Operations, Analytics-Cluster, Analytics-Kanban, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218757 (Ottomata) Will work on this today/tomorrow.
2016-04-19 17:11:40 <volans> jynus: I'm deleting pc1005* too, other 166GB
2016-04-19 17:12:00 <jynus> +1
2016-04-19 17:12:18 <volans> !log Deleting pc1005* binlog from pc2005 to make some space
2016-04-19 17:12:22 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:12:24 <jynus> I also moved sqldata-cache.bak to /home
2016-04-19 17:12:46 <jynus> that is the old, local, parsed articles
2016-04-19 17:12:59 <jynus> only 4-5 GB
2016-04-19 17:13:00 <volans> ok thanks, that was just 4.5GB
2016-04-19 17:13:11 <volans> 78% now
2016-04-19 17:13:25 <wikibugs> Operations, MediaWiki-Recent-changes, Availability, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218798 (mark)
2016-04-19 17:15:55 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (matmarex)
2016-04-19 17:16:22 <mutante> meh, jenkins-bot says Verified +2 but when you want to merge "needs Verified" ...lies
2016-04-19 17:17:27 <volans> !log Deleting pc1003* and pc1006* binlog from pc2006 to make some space
2016-04-19 17:17:32 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:20:42 <wikibugs> Operations, MediaWiki-Recent-changes, Availability, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2016-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218891 (mobrovac)
2016-04-19 17:21:09 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218892 (Yann) I purge https://fr.wikisource.org/wiki/MediaWiki:Sidebar and it looks OK now.
2016-04-19 17:21:19 <mobrovac> mutante: in those instances, refreshing the page helps
2016-04-19 17:22:26 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218905 (matmarex) I still see the bad sidebar viewing https://fr.wikisource.org/wiki/Page:Tolstoï_-_Le_salut_est_en_vous.djvu/55 when not logged in. Curiously,...
2016-04-19 17:22:50 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (bd808) I'm seeing similar issues on https://wikimediafoundation.org/wiki/Home. The sidebar was the default version for both anon and authed. A page purg...
2016-04-19 17:23:22 <ottomata> mutante: /srv on stat1004 is
2016-04-19 17:23:29 <mutante> 6.8T , just saw
2016-04-19 17:23:30 <ottomata> 6.8T avail
2016-04-19 17:23:33 <ottomata> ja
2016-04-19 17:23:34 <ottomata> :)
2016-04-19 17:23:43 <mutante> :) ok, cool, then the setup is done
2016-04-19 17:23:52 <mutante> to copy from 1001 to 1004 that is
2016-04-19 17:24:04 <ottomata> also, for rsync, i see you made a new ::migration class?
2016-04-19 17:24:34 <ottomata> we might want to just add stat1004 to lsit of statistics servers, and include statistics::rsync class
2016-04-19 17:24:34 <ottomata> all statistics_servers are configured to be able to write to each other's /srv
2016-04-19 17:24:35 <ottomata> but, whatev!
2016-04-19 17:24:35 <ottomata> temp ::migration class works too
2016-04-19 17:25:30 <logmsgbot> !log demon@tin Synchronized php-1.27.0-wmf.21/extensions/CentralAuth: forgot something (duration: 00m 42s)
2016-04-19 17:25:35 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:25:58 <mutante> ok, then
2016-04-19 17:25:59 <mutante> elukey: ^
2016-04-19 17:27:50 <grrrit-wm> (Abandoned) Dzahn: stats: adjust rsyncd pathes to use petabyte mount [puppet] - https://gerrit.wikimedia.org/r/284235 (https://phabricator.wikimedia.org/T76348) (owner: Dzahn)
2016-04-19 17:28:07 <grrrit-wm> (CR) Dzahn: "nevermind, /srv _is_ large enough, so this is done" [puppet] - https://gerrit.wikimedia.org/r/284235 (https://phabricator.wikimedia.org/T76348) (owner: Dzahn)
2016-04-19 17:28:23 <wikibugs> Operations, Analytics-Cluster, Analytics-Kanban, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218956 (Ottomata)
2016-04-19 17:33:58 <wikibugs> Operations, Ops-Access-Requests: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2218998 (Krenair) So restricted access, basically?
2016-04-19 17:34:39 <elukey> thanks mutante, didn't know about the rsync sorry :(
2016-04-19 17:34:41 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219003 (bd808) >>! In T133069#2218908, @bd808 wrote: > Now I have a strange reproduction case for anons: > * Hit https://wikimediafoundation.org/wiki/Home and s...
2016-04-19 17:35:10 <grrrit-wm> (PS2) BBlack: Common VCL: remove wikimedia.org subdomain HTTPS redirect exception [puppet] - https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826)
2016-04-19 17:36:11 <AaronSchulz> enqueue: 1141607 queued; 2250154 claimed (304734 active, 1945420 abandoned); 0 delayed [enwiki]
2016-04-19 17:36:25 <grrrit-wm> (PS4) Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324)
2016-04-19 17:36:27 <_joe_> so it's mostly enwiki?
2016-04-19 17:36:29 <Krinkle> paravoid: pong
2016-04-19 17:36:40 <AaronSchulz> next wiki is only 6k
2016-04-19 17:38:13 <AaronSchulz> _joe_: is there a generic maintenance host name (e.g. not terbium)
2016-04-19 17:38:20 <_joe_> nope
2016-04-19 17:38:30 <_joe_> wasat is the new terbium
2016-04-19 17:38:52 <grrrit-wm> (CR) Elukey: [C: 2] Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) (owner: Elukey)
2016-04-19 17:39:54 <_joe_> ostriches: I was thinking, you should use mira for scap now
2016-04-19 17:41:21 <ostriches> I thought about that the second after I sync'd.
2016-04-19 17:41:26 <ostriches> Muscle memory
2016-04-19 17:41:38 <_joe_> yeah it's not properly switched over, though
2016-04-19 17:41:52 <_joe_> I can switchover if we think it's needed, but I don't think so
2016-04-19 17:41:59 <_joe_> since we froze changes this week
2016-04-19 17:44:54 <_joe_> so outstanding problems are: 1) Missing RC changes 2) Some wikidata articles failing to render (addshore has a solution, see https://phabricator.wikimedia.org/T133048, I just don't feel confident running that script) 3) Sidebar not properly updated 4) 1 M enwiki jobs
2016-04-19 17:46:13 <Krenair> do we have tickets for each?
2016-04-19 17:46:17 <Krenair> I know there's one for RC
2016-04-19 17:46:29 <Krenair> I know addshore has an access request for maint scripts
2016-04-19 17:46:52 <Krenair> aha, sidebar was filed: https://phabricator.wikimedia.org/T133069
2016-04-19 17:47:09 <wikibugs> Operations, Analytics-Kanban, Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2219074 (Dzahn) We now have an rsyncd running on stat1004, ready to accept data from stat1001, it will be in /srv/stat1001/ , there are 3 modules, one for home, one for srv and...
2016-04-19 17:47:17 <Krenair> not sure about 1m jobs
2016-04-19 17:49:08 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219084 (Dzahn) >>! In T128968#2216663, @Peachey88 wrote: > That is offtopic for this task, Please file a seperate task This has been opened before as T93523 . It wa...
2016-04-19 17:49:52 <wikibugs> Operations, MediaWiki-Recent-changes, Availability, Security-General, codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2016-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2219089 (matmarex) a:matmarex I'm going to wo...
2016-04-19 17:49:57 <jynus> !log setting binlog_format=ROW on old x1-master at eqiad (db1029) to reenable replication
2016-04-19 17:50:01 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 17:50:37 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219092 (Sjoerddebruin) Note: the domain seems to be registered already. But all I get is MacKeeper shit...
2016-04-19 17:50:51 <grrrit-wm> (PS1) Volans: MariaDB: Fix pt-heartbeat for x1 codfw master [puppet] - https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699)
2016-04-19 17:51:02 <_joe_> addshore: still around?
2016-04-19 17:51:06 <addshore> yup
2016-04-19 17:51:26 <_joe_> addshore: that script should run just on wikidata, right?
2016-04-19 17:51:41 <_joe_> so --wiki wikidatawiki if I'm not wrong
2016-04-19 17:52:02 <addshore> yup
2016-04-19 17:52:03 <grrrit-wm> (PS1) Rush: kubernetes to v1.2.2wmf1 [puppet] - https://gerrit.wikimedia.org/r/284244
2016-04-19 17:52:05 <addshore> and that should be it
2016-04-19 17:52:39 <_joe_> addshore: I'm going to run that then :)
2016-04-19 17:52:40 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219098 (Dzahn) Yep, looks like it's already gone. , registered to a Mr. Gerbert in London.
2016-04-19 17:52:51 <addshore> _joe_: awesome, and I'll be here throughout :)
2016-04-19 17:52:51 <grrrit-wm> (CR) Yuvipanda: [C: ] kubernetes to v1.2.2wmf1 [puppet] - https://gerrit.wikimedia.org/r/284244 (owner: Rush)
2016-04-19 17:53:17 <icinga-wm> RECOVERY - MySQL Slave Running on db1029 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error
2016-04-19 17:53:36 <jynus> ^seems to work, and recovering from lag quickly
2016-04-19 17:53:42 <volans> cool
2016-04-19 17:53:44 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219099 (mmodell) One interesting thing that phabricator seems to be implementing it's own load balancing for repositories. I think the idea is that any fr...
2016-04-19 17:56:56 <grrrit-wm> (CR) Rush: [C: 2 V: 2] kubernetes to v1.2.2wmf1 [puppet] - https://gerrit.wikimedia.org/r/284244 (owner: Rush)
2016-04-19 17:57:43 <wikibugs> Operations, WMF-Legal, Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2219108 (Krinkle) p:Triage>Normal
2016-04-19 17:57:58 <icinga-wm> PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 17:58:34 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219113 (Dzahn) @Robh @Papaul do we have a server in codfw that matches "4-8 core CPU, 16-32G ram and 500g of non-mirrored storage." and could be used for t...
2016-04-19 17:58:48 <icinga-wm> PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 17:59:04 <grrrit-wm> (PS2) Volans: MariaDB: Fix pt-heartbeat for x1 codfw master [puppet] - https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699)
2016-04-19 18:00:04 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219126 (mmodell)
2016-04-19 18:00:08 <icinga-wm> RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
2016-04-19 18:00:23 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219132 (bd808) Purging https://wikimediafoundation.org/w/index.php?title=Questions_for_Wikimedia%3F&redirect=no fixed that one instance. This is easily explaine...
2016-04-19 18:00:26 <_joe_> !log running rebuildEntityPerPage.php on wikidata, T133048
2016-04-19 18:00:27 <stashbot> T133048: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048
2016-04-19 18:00:31 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 18:00:33 <grrrit-wm> (CR) Volans: [C: 2] "changes looks good https://puppet-compiler.wmflabs.org/2503/"; [puppet] - https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699) (owner: Volans)
2016-04-19 18:01:08 <icinga-wm> RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy
2016-04-19 18:02:47 <elukey> aqs is still suffering for the Cassandra timeouts, new hardware will arrive soon
2016-04-19 18:05:28 <icinga-wm> RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.07 seconds
2016-04-19 18:05:29 <grrrit-wm> (PS1) Ottomata: Add analytics1003 in netboot.cfg and site.pp [puppet] - https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840)
2016-04-19 18:05:45 <icinga-wm> RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.08 seconds
2016-04-19 18:05:47 <yurik> i cannot login into horizon.wikimedia.org - is that part of the switchover?
2016-04-19 18:06:13 <wikibugs> Operations, Wikidata, codfw-rollout: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048#2219157 (Addshore) Open>Resolved a:Addshore Looks resolved to me!
2016-04-19 18:06:40 <akosiaris> yurik: no it is not
2016-04-19 18:06:56 <icinga-wm> RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.15 seconds
2016-04-19 18:07:08 <icinga-wm> RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.24 seconds
2016-04-19 18:07:17 <icinga-wm> PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 18:07:34 <paravoid> ottomata: "Throughput of EventLogging NavigationTiming events" ?
2016-04-19 18:07:41 <addshore> Krenair: that wikidata one is ticket off the list :)
2016-04-19 18:07:44 <ottomata> looking
2016-04-19 18:07:48 <paravoid> urandom: SSTables alert?
2016-04-19 18:07:49 <yurik> akosiaris, i logged in about an hour ago, now i tired it again and it fails
2016-04-19 18:07:57 <icinga-wm> PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-04-19 18:08:20 <ottomata> btw, paravoid looking, the proper folks to ping about that are performance folks
2016-04-19 18:08:24 <grrrit-wm> (PS1) Yuvipanda: tools: Add a null check for manifest version checking [puppet] - https://gerrit.wikimedia.org/r/284250
2016-04-19 18:08:30 <ottomata> that comes from the statsv instance running on hafnium
2016-04-19 18:08:52 <Krinkle> ottomata: Is there an example in prod that uses get_simple_consumer in the way you described yesterday? I know it's a trivial parameter, but I can't statsv easily so would be nice to have a reference to something. and not be the only one using it that way
2016-04-19 18:08:58 <grrrit-wm> (PS2) Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:09:17 <icinga-wm> RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
2016-04-19 18:09:29 <icinga-wm> ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1029 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check
2016-04-19 18:09:35 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219177 (RobH) a:mark No need to ping Papaul, he doesn't have any involvement in #hardware-requests. (Its primarily myself, and if I am out sick, then...
2016-04-19 18:09:37 <ottomata> hm, not that I know of Krinkle, the only other prod use of pykafka I know of is the eventlogging handler. it uses the pykafka balancedConsumer though
2016-04-19 18:09:39 <ottomata> not the simple consumer
2016-04-19 18:09:40 <ottomata> but
2016-04-19 18:09:42 <grrrit-wm> (PS2) Yuvipanda: tools: Add a null check for manifest version checking [puppet] - https://gerrit.wikimedia.org/r/284250
2016-04-19 18:09:43 <ottomata> the configs should be the same
2016-04-19 18:09:44 <grrrit-wm> (PS3) Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:09:46 <ottomata> mostly
2016-04-19 18:09:49 <grrrit-wm> (CR) Yuvipanda: [C: 2 V: 2] tools: Add a null check for manifest version checking [puppet] - https://gerrit.wikimedia.org/r/284250 (owner: Yuvipanda)
2016-04-19 18:09:58 <Krinkle> addshore: https://phabricator.wikimedia.org/T133000
2016-04-19 18:09:58 <icinga-wm> RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy
2016-04-19 18:10:02 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219181 (RobH)
2016-04-19 18:10:07 <ottomata> but, the example uses the eventlogging URI handler scheme for config and passes them via kwargs
2016-04-19 18:10:11 <Krinkle> addshore: I filed that yesterday before the switchover started
2016-04-19 18:10:15 <grrrit-wm> (PS4) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:10:15 <ottomata> so i can't really point you to a code example directly
2016-04-19 18:10:33 <addshore> Krinkle: ooh, at a first glance that looks unrelated
2016-04-19 18:10:41 <ottomata> Krinkle: i strangely see this in service statsv status output
2016-04-19 18:10:42 <ottomata> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 40: invalid start byte
2016-04-19 18:10:47 <Krinkle> addshore: Well, it makes RC ununsable, so it should be fixed too :)
2016-04-19 18:10:54 <ottomata> para void pinged about the icinga alert showing unknown, that's why i'm looking
2016-04-19 18:10:57 <addshore> ahh okay!
2016-04-19 18:11:03 <Krinkle> it's a fatal
2016-04-19 18:11:19 <addshore> got to dig into a possible solution for another wikidata thing first as fallout from the switch
2016-04-19 18:11:24 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219189 (Papaul) @Dzahn WMF5849 rbf2001 A5 Dell PowerEdge R420 Intel® Xeon® Processor E5-2440 3.00 6 cores Yes 32 GB RAM (2) 500GB SATA WMF3641 B5 Dell Powe...
2016-04-19 18:11:57 <addshore> Do we have a timestamp at which point jobs will not have being able to be queued and a timestamp that the queues started working again?
2016-04-19 18:14:36 <grrrit-wm> (PS5) Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:14:40 <grrrit-wm> (CR) Yuvipanda: [C: -1] "I'm pretty sure we shouldn't be using puppet's auth.conf - that'll diverge us from prod's puppetmaster a lot more than necessary on import" [puppet] - https://gerrit.wikimedia.org/r/284103 (owner: Andrew Bogott)
2016-04-19 18:16:06 <ottomata> Krinkle: are you plannign on just adding auto_offset_reset=-1, or are you going to tell it to commit offsets too?
2016-04-19 18:16:18 <Krinkle> probably auto_offset_reset=-1
2016-04-19 18:16:23 <ottomata> i think that makes sense for statsv
2016-04-19 18:16:25 <ottomata> cool
2016-04-19 18:16:36 <ottomata> I'm pretty sure if you just add that to the get_simple_consumer call
2016-04-19 18:16:44 <ottomata> it will just work ™
2016-04-19 18:17:11 <grrrit-wm> (CR) Andrew Bogott: "Auth.conf is nonetheless present on the puppetmaster. With it or without it, puppet rejects queries to that url unless it is explicitly p" [puppet] - https://gerrit.wikimedia.org/r/284103 (owner: Andrew Bogott)
2016-04-19 18:17:33 <grrrit-wm> (CR) Dzahn: [C: 2] kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 18:18:28 <ottomata> heh, Krinkle not sure what is wrong with statsv on hafnium right now
2016-04-19 18:18:28 <grrrit-wm> (CR) Yuvipanda: "The current auth.conf is a strange beast - if you look at auth.conf.orig, that was what was there before, and probably was far more permis" [puppet] - https://gerrit.wikimedia.org/r/284103 (owner: Andrew Bogott)
2016-04-19 18:18:34 <ottomata> i would restart it, but... :p
2016-04-19 18:18:42 <Krinkle> I'm not aware of there being an issue
2016-04-19 18:18:51 <Krinkle> I noticed you got pinged earlier by someone with the name of that task
2016-04-19 18:18:55 <Krinkle> what is this about?
2016-04-19 18:18:58 <Krinkle> I couldn't find it in backscroll
2016-04-19 18:18:59 <addshore> *scrolls up* there is nothing in the SAL about when jobchron came back :/
2016-04-19 18:19:05 <grrrit-wm> (PS4) Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:19:24 <ottomata> Krinkle: about https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=Throughput+of+EventLogging+NavigationTiming+events
2016-04-19 18:19:31 <grrrit-wm> (CR) Dzahn: [C: 2] kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 18:19:36 <ottomata>   UNKNOWN   (for 0d 20h 19m 19s)
2016-04-19 18:19:37 <Krinkle> ottomata: Yes, please don't restart. I'd rather have it shutdown indefinitely in that case. All my graphs are beyond useless because of the previous incident.
2016-04-19 18:19:44 <Krinkle> Gonna have to figure a way to wipe that in Graphite
2016-04-19 18:20:05 <ottomata> Krinkle: hm, eyah. if you submit to graphite directly you can set timestamp
2016-04-19 18:20:18 <ottomata> maybe you can do that and use timestamp of event in statsv? meh, maybe you don't have that
2016-04-19 18:20:27 <Krinkle> statsv has no concept of timestamps
2016-04-19 18:20:28 <ottomata> Krinkle: para void just pinged me about that icinga alert
2016-04-19 18:20:37 <Krinkle> it buffers all incoming packets and flushes an aggregate once per minute to graphite
2016-04-19 18:20:37 <ottomata> right ja, its supposed to be statsd :9
2016-04-19 18:20:38 <ottomata> :p
2016-04-19 18:20:48 <Krinkle> ottomata: The old one?
2016-04-19 18:20:53 <ottomata> happening now
2016-04-19 18:20:55 <ottomata> https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=Throughput+of+EventLogging+NavigationTiming+events
2016-04-19 18:20:56 <Krinkle> link?
2016-04-19 18:21:05 <ottomata> ^^
2016-04-19 18:21:07 <ottomata> and on hafnium:
2016-04-19 18:21:19 <ottomata> sudo service statsv status
2016-04-19 18:21:19 <ottomata> ...
2016-04-19 18:21:24 <ottomata> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 40: invalid start byte
2016-04-19 18:21:27 <Krinkle> doesn't have sudo there
2016-04-19 18:21:30 <addshore> https://www.wikidata.org/wiki/Wikidata_talk:Main_Page#Sidebar It would appear the link to the mainpage on wikidata now links to the wrong place, could this really be codfw fallout...
2016-04-19 18:21:43 <ottomata> wha?!
2016-04-19 18:21:44 <ottomata> heh
2016-04-19 18:22:05 <ottomata> dunno what data its loading
2016-04-19 18:22:07 <ottomata> but its stuck :/
2016-04-19 18:22:17 <ottomata> data = json.loads(raw_data) is throwing a decode exception
2016-04-19 18:22:44 <Krinkle> ottomata: Yeah, it's quite possible I guess. When I was tailing statsv from kafka the other day I also got a half packet of incomplete json
2016-04-19 18:22:45 <ottomata> Krinkle: should we just work now to make statsv use auto_offset_reset=-1, deploy that change, and then restart?
2016-04-19 18:22:59 <Krinkle> It's been known to happen a few times. I saw that Ori's been patching various of our consumers to catch parse errors
2016-04-19 18:23:01 <Krinkle> but it's annoying.
2016-04-19 18:23:19 <Krinkle> ottomata: I guess, yeah, that sounds good. But I don't see an issue yet though.
2016-04-19 18:23:26 <Krinkle> Data seems to be coming in from where I'm looking
2016-04-19 18:23:39 <ottomata> oh
2016-04-19 18:23:39 <ottomata> hm
2016-04-19 18:23:41 <ottomata> ja?
2016-04-19 18:23:48 <ottomata> ok maybe its not dead then
2016-04-19 18:23:50 <Krinkle> both eventlogging and statsv
2016-04-19 18:23:55 <Krinkle> ottomata: rememebr navtiming is not statsv
2016-04-19 18:24:07 <Krinkle> eventlogging ->navtiming
2016-04-19 18:24:11 <ottomata> HMMMM
2016-04-19 18:24:13 <Krinkle> statsv is well, statsv
2016-04-19 18:24:22 <Krinkle> not related to navtiming or eventlogging
2016-04-19 18:24:25 <ottomata> yeah ok looking at alert, eyah, its based on kafka topic not statsd stuff
2016-04-19 18:24:26 <ottomata> sorry
2016-04-19 18:24:33 <Krinkle> shatters reality
2016-04-19 18:24:44 <ottomata> HAHAH
2016-04-19 18:24:45 <ottomata> DUH
2016-04-19 18:24:46 <ottomata> whoops
2016-04-19 18:24:53 <ottomata> getting my wires crossed here
2016-04-19 18:25:17 <ottomata> hm, ok, the data for navtiming is fine in el
2016-04-19 18:25:17 <Krinkle> https://grafana.wikimedia.org/dashboard/db/performance-metrics?from=now-1h (eventlogging navtiming -> statsd -> graphite)
2016-04-19 18:25:18 <icinga-wm> RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.043 second response time
2016-04-19 18:25:20 <ottomata> and kafka
2016-04-19 18:25:28 <Krinkle> https://grafana.wikimedia.org/dashboard/db/media?from=now-1h (statsv -> statsd)
2016-04-19 18:25:29 <icinga-wm> RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
2016-04-19 18:25:31 <Krinkle> both seem fine
2016-04-19 18:25:43 <ottomata> ok looks like alert is faulty then, will look into it, sorry for dragging you into that
2016-04-19 18:25:56 <Krinkle> no worries
2016-04-19 18:26:14 <Krinkle> btw, scap sudo, I can't even ssh into hafnium apparently.
2016-04-19 18:26:16 <Krinkle> scrap*
2016-04-19 18:26:34 <Krinkle> not that I need to
2016-04-19 18:27:02 <ottomata> ha, i think you should be able to!
2016-04-19 18:27:08 <addshore> ahh Krenair was there a ticket for that sidebar thing?
2016-04-19 18:27:13 <ottomata> who's maintaining statvs then :)
2016-04-19 18:27:20 <Krenair> addshore: https://phabricator.wikimedia.org/T133069
2016-04-19 18:27:25 <mutante> !log kraz.codfw, reinstalling as kraz.wikimedia
2016-04-19 18:27:29 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 18:28:04 <Krinkle> ottomata: performance team yes. The're both in puppet/files/webperf/
2016-04-19 18:29:08 <icinga-wm> PROBLEM - Host kraz is DOWN: PING CRITICAL - Packet loss = 100%
2016-04-19 18:29:13 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (Addshore) Vaguely similar, the link to the main page for wikidata.org now links to the incorrect place. See https://www.wikidata.org/wiki/Wikidata_talk...
2016-04-19 18:33:21 <Dereckson> AaronSchulz: hi, so if you run the MatmaRex script, you need to do it on wasat.codfw.wmnet, not on terbium.
2016-04-19 18:35:03 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (Legoktm) I think at some point MessageCache failed and started returning the defaults for everything, which ended up getting cached by other things (sid...
2016-04-19 18:36:57 <Krinkle> mutante: Hm.. I didn't realise until now that kraz is in codfw
2016-04-19 18:38:29 <mutante> isnt that a good thing?
2016-04-19 18:39:56 <grrrit-wm> (PS1) Eevans: raise highest SSTables (again) [puppet] - https://gerrit.wikimedia.org/r/284256
2016-04-19 18:40:11 <urandom> paravoid: ^^
2016-04-19 18:41:03 <grrrit-wm> (PS2) Faidon Liambotis: raise highest SSTables (again) [puppet] - https://gerrit.wikimedia.org/r/284256 (owner: Eevans)
2016-04-19 18:41:23 <urandom> paravoid: it would be great if there was something in between, nothing, and annoy everyone
2016-04-19 18:41:25 <grrrit-wm> (PS3) Faidon Liambotis: Raise highest SSTables thresholds (again) [puppet] - https://gerrit.wikimedia.org/r/284256 (owner: Eevans)
2016-04-19 18:41:40 <grrrit-wm> (CR) Faidon Liambotis: [C: 2 V: 2] Raise highest SSTables thresholds (again) [puppet] - https://gerrit.wikimedia.org/r/284256 (owner: Eevans)
2016-04-19 18:42:38 <urandom> paravoid: the answer seems to be, "routinely put eyeballs on a bunch of graphs", which is... disappointing
2016-04-19 18:45:07 <yurik> greg-g, is it ok to make labs-only config changes this week?
2016-04-19 18:45:19 <yurik> and by labs i mean betacluster :)
2016-04-19 18:45:48 <YuviPanda> pats yurik
2016-04-19 18:45:49 <grrrit-wm> (PS1) Ottomata: Fix eventlogging_NavigationTiming_throughput again - need sumSeries() [puppet] - https://gerrit.wikimedia.org/r/284257
2016-04-19 18:46:03 <YuviPanda> yurik: am pretty sure greg-g is on leave now.
2016-04-19 18:46:22 <yurik> YuviPanda, ok, do you know if there are any restrictions on beta cluster changes?
2016-04-19 18:46:50 <YuviPanda> yurik: I do not know, sorry. ask in #wikimedia-releng maybe
2016-04-19 18:51:12 <grrrit-wm> (CR) Ottomata: [C: 2] Fix eventlogging_NavigationTiming_throughput again - need sumSeries() [puppet] - https://gerrit.wikimedia.org/r/284257 (owner: Ottomata)
2016-04-19 18:51:23 <wikibugs> Operations: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2219457 (Anomie)
2016-04-19 18:51:28 <grrrit-wm> (PS2) Ottomata: Add analytics1003 in netboot.cfg and site.pp [puppet] - https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840)
2016-04-19 18:51:36 <grrrit-wm> (CR) Ottomata: [C: 2 V: 2] Add analytics1003 in netboot.cfg and site.pp [puppet] - https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840) (owner: Ottomata)
2016-04-19 18:51:56 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (Joe) So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we had a temporary overload of the externalst...
2016-04-19 18:58:00 <grrrit-wm> (PS1) Dzahn: install: update MAC address of kraz [puppet] - https://gerrit.wikimedia.org/r/284259 (https://phabricator.wikimedia.org/T123729)
2016-04-19 18:58:59 <grrrit-wm> (PS1) Eevans: Disable RESTBase highest max SSTables per read threshold [puppet] - https://gerrit.wikimedia.org/r/284262
2016-04-19 18:59:15 <grrrit-wm> (CR) Dzahn: [C: 2] install: update MAC address of kraz [puppet] - https://gerrit.wikimedia.org/r/284259 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 19:03:12 <wikibugs> Operations, EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2219493 (RobH)
2016-04-19 19:03:14 <wikibugs> Operations, Analytics-Cluster, EventBus, Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2219495 (RobH)
2016-04-19 19:03:17 <wikibugs> Operations, EventBus, hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2219491 (RobH) stalled>Resolved As this task has had systems allocated, and setup is via T131959, resolving this request.
2016-04-19 19:03:55 <wikibugs> Operations, hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2219500 (RobH)
2016-04-19 19:04:59 <wikibugs> Operations, Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2219505 (RobH)
2016-04-19 19:05:02 <wikibugs> Operations, hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2219503 (RobH) stalled>Resolved This request has been fulfilled via order on #procurement task T127508. Resolving this #hardware-requests task.
2016-04-19 19:05:47 <wikibugs> Operations, RESTBase, hardware-requests, Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2219514 (RobH)
2016-04-19 19:05:50 <wikibugs> Operations, RESTBase, hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2219512 (RobH) Open>Resolved This was resolved awhile ago, and this task was overlooked (as the sub-tasks had the actual work performed on them.)
2016-04-19 19:06:39 <legoktm> !log purging sidebar cache across all wikis (T133069)
2016-04-19 19:06:40 <stashbot> T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069
2016-04-19 19:06:40 <wikibugs> Blocked-on-Operations, Operations, RESTBase, hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2219524 (RobH)
2016-04-19 19:06:42 <wikibugs> Operations, RESTBase, hardware-requests, Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2219522 (RobH) Open>Resolved These systems are being replaced via the sub-tasks. Since the hardware request is granted, I'm...
2016-04-19 19:06:43 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 19:06:46 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219528 (jayvdb) I cant see T93523. It is private? Can it be made public, or should we recreate a new task about that problem.
2016-04-19 19:07:20 <grrrit-wm> (PS1) Dzahn: add tegmen and einsteinium to site.pp [puppet] - https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023)
2016-04-19 19:10:53 <wikibugs> Operations, Traffic, WMF-Legal, domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219572 (Dzahn) Yes, it has a custom policy. It has been created by @Glaisher. I modified the custom policies by adding your user manually. Try again now?
2016-04-19 19:12:11 <AaronSchulz> !log Cleared enwiki 'enqueue' queue (T133089)
2016-04-19 19:12:12 <stashbot> T133089: enwiki "enqueue" queue showed corruption - https://phabricator.wikimedia.org/T133089
2016-04-19 19:12:15 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 19:14:40 <wikibugs> Operations, EventBus, Services, hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2219593 (RobH) irc update: In triaging the #hw-requests, I've checked with @ottomata. This needs to have further investigation done, so I'm keeping it assig...
2016-04-19 19:25:04 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219659 (Legoktm) The script completed in about 4 minutes. Now we need a varnish purge for every page cached after the switchover till my script finished.
2016-04-19 19:26:29 <wikibugs> Operations, RESTBase-Cassandra, Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2219667 (Eevans)
2016-04-19 19:27:14 <grrrit-wm> (PS2) Eevans: Disable RESTBase highest max SSTables per read threshold [puppet] - https://gerrit.wikimedia.org/r/284262 (https://phabricator.wikimedia.org/T133091)
2016-04-19 19:29:58 <wikibugs> Operations, RESTBase-Cassandra, Services, Patch-For-Review: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2219709 (Eevans)
2016-04-19 19:30:29 <wikibugs> Operations, Patch-For-Review, developer-notice, notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2219712 (Dzahn) reinstalled with public IP as kraz.wikimedia.org, in puppet and up and running.
2016-04-19 19:35:14 <grrrit-wm> (CR) Andrew Bogott: "auth.conf.orig ends with:" [puppet] - https://gerrit.wikimedia.org/r/284103 (owner: Andrew Bogott)
2016-04-19 19:36:09 <grrrit-wm> (CR) Dzahn: [C: -1] add tegmen and einsteinium to site.pp [puppet] - https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) (owner: Dzahn)
2016-04-19 19:38:26 <icinga-wm> PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures
2016-04-19 19:39:38 <grrrit-wm> (PS1) Jcrespo: Revert "Depool one db server from each shard as a backup" [mediawiki-config] - https://gerrit.wikimedia.org/r/284271
2016-04-19 19:39:46 <grrrit-wm> (PS2) Jcrespo: Revert "Depool one db server from each shard as a backup" [mediawiki-config] - https://gerrit.wikimedia.org/r/284271
2016-04-19 19:39:50 <grrrit-wm> (PS1) Ottomata: Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840)
2016-04-19 19:40:59 <grrrit-wm> (PS1) Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729)
2016-04-19 19:41:06 <grrrit-wm> (CR) jenkins-bot: [V: -1] Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) (owner: Ottomata)
2016-04-19 19:41:40 <grrrit-wm> (CR) Jcrespo: [C: 2] Revert "Depool one db server from each shard as a backup" [mediawiki-config] - https://gerrit.wikimedia.org/r/284271 (owner: Jcrespo)
2016-04-19 19:42:17 <grrrit-wm> (PS2) Ottomata: Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840)
2016-04-19 19:42:28 <grrrit-wm> (PS2) Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729)
2016-04-19 19:42:46 <grrrit-wm> (CR) Dzahn: [C: 2] ircserver/irc_echo: use systemd provider if on jessie [puppet] - https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 19:43:36 <logmsgbot> !log jynus@tin Synchronized wmf-config/db-eqiad.php: Revert "Depool one db server from each shard as a backup" (duration: 00m 27s)
2016-04-19 19:43:40 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 19:44:15 <grrrit-wm> (CR) Ottomata: [C: 2] Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) (owner: Ottomata)
2016-04-19 19:44:24 <wikibugs> Operations: Investigate idle appservers in codfw - https://phabricator.wikimedia.org/T133093#2219746 (Southparkfan)
2016-04-19 19:45:30 <grrrit-wm> (PS3) Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729)
2016-04-19 19:45:35 <grrrit-wm> (PS2) Dzahn: add tegmen and einsteinium to site.pp [puppet] - https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023)
2016-04-19 19:48:59 <grrrit-wm> (CR) Dzahn: "noop on argon - now needs unit files on kraz" [puppet] - https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 19:49:19 <grrrit-wm> (PS3) Dzahn: add tegmen and einsteinium to site.pp [puppet] - https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023)
2016-04-19 19:49:50 <grrrit-wm> (CR) Dzahn: [C: 2] "just start by adding them with base::firewall" [puppet] - https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) (owner: Dzahn)
2016-04-19 19:51:10 <icinga-wm> ACKNOWLEDGEMENT - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 5 failures daniel_zahn T123729
2016-04-19 19:51:28 <logmsgbot> !log ori@tin Synchronized php-1.27.0-wmf.21/maintenance/rebuildrecentchanges.php: Ie9799f5ea: rebuildrecentchanges: Allow rebuilding specified time range only (duration: 00m 28s)
2016-04-19 19:51:32 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 19:52:55 <wikibugs> Operations: Investigate idle appservers in codfw - https://phabricator.wikimedia.org/T133093#2219824 (mark) p:Triage>Lowest
2016-04-19 19:53:14 <paravoid> !log staggered varnish bans for 'obj.http.server ~ "^mw2.+"' as a workaround for T133069
2016-04-19 19:53:14 <stashbot> T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069
2016-04-19 19:53:18 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 19:55:46 <icinga-wm> PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures
2016-04-19 19:57:21 <grrrit-wm> (PS1) Ottomata: analytics1015 -> analytics1003 migration [puppet] - https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840)
2016-04-19 19:57:44 <wikibugs> Operations, Icinga, Patch-For-Review: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2219847 (Dzahn) >>! In T125023#2198454, @akosiaris wrote: > @dzahn. We 've already got replacement boxes. > But don't just reuse einsteinium to replace neon, the idea is to manage to kil...
2016-04-19 19:57:50 <grrrit-wm> (CR) Ottomata: [C: -1] "To be merged during migration" [puppet] - https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840) (owner: Ottomata)
2016-04-19 19:58:05 <grrrit-wm> (PS1) Dzahn: icinga: put role on einsteinium for testing [puppet] - https://gerrit.wikimedia.org/r/284277 (https://phabricator.wikimedia.org/T125023)
2016-04-19 19:58:43 <nuria_> bblack: question if you guys are not in the midst of switchover crazyness
2016-04-19 19:59:20 <wikibugs> Operations, Phabricator, Traffic, hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219859 (RobH) @papaul: Daniel shouldn't have pinged you for this, as I handle the #hardware-requests, you can disregard. Thanks!
2016-04-19 20:00:57 <wikibugs> Operations, Analytics-Cluster, Analytics-Kanban, hardware-requests, and 2 others: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2219864 (Ottomata) Ok, we are ready to proceed. Plan here: https://etherpad.wikimedia.org/p/analytics-meta 1. stop camus early...
2016-04-19 20:01:54 <grrrit-wm> (PS2) Andrew Bogott: Updates to nova policy.json: [puppet] - https://gerrit.wikimedia.org/r/284233 (https://phabricator.wikimedia.org/T132187)
2016-04-19 20:05:43 <bblack> nuria_: ?
2016-04-19 20:05:51 <nuria_> bblack: yessir, question:
2016-04-19 20:06:07 <nuria_> bblack: http headers when read by varnish are case sensitive correct?
2016-04-19 20:06:10 <grrrit-wm> (Abandoned) Thcipriani: Bump portals [mediawiki-config] - https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: Thcipriani)
2016-04-19 20:06:26 <bblack> bblack: depends...
2016-04-19 20:06:40 <grrrit-wm> (PS2) Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871)
2016-04-19 20:06:51 <nuria_> bblack: ajam, depends on ..?
2016-04-19 20:07:06 <bblack> well a lot of things
2016-04-19 20:07:22 <bblack> the values are going to be case-sensitive, unless we do a case-insensitive regex match
2016-04-19 20:07:24 <matt_flaschen> jynus, I know you're probably really busy with the datacenter switchover, but https://gerrit.wikimedia.org/r/#/c/282440/ could use review whenever you are able to get to it. Let me know if I can clarify anything.
2016-04-19 20:07:40 <bblack> I don't think the keys are sensitive (req.http.host and req.http.Host should have the same meaning)
2016-04-19 20:07:51 <jynus> matt_flaschen, while that is great, worse timing possible
2016-04-19 20:07:56 <bblack> but there are probably other ways you could mean that question too
2016-04-19 20:08:09 <nuria_> bblack: right, so this by default
2016-04-19 20:08:10 <nuria_> https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/analytics.inc.vcl.erb#L162
2016-04-19 20:08:27 <matt_flaschen> Yeah, doesn't have to be today, or even this week. Just wanted to reach out now that you're back.
2016-04-19 20:08:44 <bblack> nuria_: the line you linked, the case of 'X-Analytics' doesn't matter
2016-04-19 20:08:44 <nuria_> bblack: you think is not case sensitive?
2016-04-19 20:09:13 <ori> !log ran `mwscript rebuildrecentchanges.php --wiki=testwiki --from=20160419144741 --to=20160419151018`
2016-04-19 20:09:17 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 20:09:33 <bblack> nuria_: some days I'm not even sure whether I'm breathing real air, nothing is 100% :)
2016-04-19 20:09:52 <bblack> nuria_: but I'm pretty sure in the header name matches for req.http.FoO, case does not matter
2016-04-19 20:10:07 <paravoid> bblack: see above -- I'm banning caches for pages served by mw2xxx
2016-04-19 20:10:17 <nuria_> bblack: ok, req and resp, right?
2016-04-19 20:10:43 <paravoid> backend codfw is done, there was a small spike on appserver traffic that remained for a while
2016-04-19 20:10:54 <paravoid> it's dropped a bit now, still a little more elevated than normal
2016-04-19 20:11:05 <paravoid> I just banned it from ulsfo backends slowly
2016-04-19 20:11:12 <bblack> ban?
2016-04-19 20:11:14 <paravoid> yeah
2016-04-19 20:11:19 <wikibugs> Operations, OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2219937 (Legoktm)
2016-04-19 20:11:28 <bblack> yeah ok
2016-04-19 20:11:41 <bblack> sorry, I'm still catching up. if what you're doing is working, keep at it :)
2016-04-19 20:11:46 <paravoid> obj.http.server ~ "^mw2.+"
2016-04-19 20:11:47 <ori> bblack: sidebar HTML was bad, had to ban 'obj.http.server ~ '^mw2.*"'
2016-04-19 20:12:41 <paravoid> (T133069 is the task)
2016-04-19 20:12:41 <stashbot> T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069
2016-04-19 20:13:12 <bblack> yeah I've seen the task
2016-04-19 20:13:33 <bblack> please don't close it after the purge though, because IMHO the code's behavior is wrong regardless of any fix we do here
2016-04-19 20:13:51 <_joe_> bblack: I already said that too :)
2016-04-19 20:14:30 <bblack> nuria_: yes, req and resp are the same.
2016-04-19 20:14:58 <paravoid> I wouldn't :)
2016-04-19 20:17:11 <grrrit-wm> (PS1) Dzahn: ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:20:21 <bblack> of course https://wikitech.wikimedia.org/wiki/Varnish#How_to_execute_a_ban_across_a_cluster is eqiad-specific heh
2016-04-19 20:21:22 <paravoid> heh
2016-04-19 20:21:26 <paravoid> I'm not following that exactly anyway
2016-04-19 20:21:36 <paravoid> I'm not doing "not codfw", I'm going site per site to be on the safe side
2016-04-19 20:21:49 <bblack> ok
2016-04-19 20:21:54 <grrrit-wm> (PS2) Dzahn: ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:22:07 <bblack> well the 3x commands there are still the "right thing", just s/eqiad/codfw/
2016-04-19 20:22:09 <paravoid> it doesn't look like anything we can't handle atm, but it's still a considerable amount of extra traffic
2016-04-19 20:22:20 <bblack> you can break them into DC sub-steps, but not change the ordering
2016-04-19 20:22:22 <bblack> (much)
2016-04-19 20:22:33 <paravoid> I know :)
2016-04-19 20:23:12 <bblack> for bans that aren't super-time-critical, spacing out BE from FE can really reduce the impact too
2016-04-19 20:23:25 <paravoid> that's what i'm doing
2016-04-19 20:23:42 <paravoid> spacing out codfw be from the rest of the bes, and then fes
2016-04-19 20:24:13 <paravoid> so that the other sites and frontends can absorb a little this extra load
2016-04-19 20:24:28 <bblack> yeah
2016-04-19 20:24:30 <paravoid> this isn't super-time-critical, just annoying I assume
2016-04-19 20:24:55 <paravoid> http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+codfw&m=cpu_report&s=by+name&mc=2&g=network_report
2016-04-19 20:25:14 <bblack> that's not so bad
2016-04-19 20:25:15 <paravoid> it's 10-15%
2016-04-19 20:25:40 <grrrit-wm> (PS3) Dzahn: ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:25:49 <grrrit-wm> (PS3) Volans: MariaDB: complete TLS and master configuration [puppet] - https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654)
2016-04-19 20:27:19 <grrrit-wm> (CR) Dzahn: "noop on argon http://puppet-compiler.wmflabs.org/2505/"; [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:27:29 <grrrit-wm> (CR) Dzahn: [C: 2] ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:28:31 <grrrit-wm> (CR) Andrew Bogott: [C: 2] Updates to nova policy.json: [puppet] - https://gerrit.wikimedia.org/r/284233 (https://phabricator.wikimedia.org/T132187) (owner: Andrew Bogott)
2016-04-19 20:29:11 <mutante> so reliable that somebody merges while you are waiting for jenkins
2016-04-19 20:29:18 <grrrit-wm> (PS4) Dzahn: ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:29:38 <grrrit-wm> (CR) Dzahn: [V: 2] ircserver: add systemd unit file and conditionals [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:31:09 <paravoid> traffic is entirely back to regular levels now
2016-04-19 20:31:16 <paravoid> bans on ulsfo/eqiad backends didn't even make a dent
2016-04-19 20:31:18 <grrrit-wm> (CR) Dzahn: "and next issue is now: Could not find dependency File[/etc/init/ircd.conf] :p" [puppet] - https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:32:30 <paravoid> bblack: did you have a way to ban specific time ranges?
2016-04-19 20:32:35 <wikibugs> Operations, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (BBlack) >>! In T133069#2219464, @Joe wrote: > So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we h...
2016-04-19 20:32:40 <paravoid> I remember you saying that you had tried something before but wasn't sure if it worked or something
2016-04-19 20:32:44 <icinga-wm> PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail
2016-04-19 20:33:13 <bblack> paravoid: yeah, I'm not sure if it works
2016-04-19 20:33:28 <bblack> I think I used obj.http.Date?
2016-04-19 20:33:39 <ori> there's a backend-timing header too, it contains a timestamp from the backend
2016-04-19 20:33:54 <bblack> I mean it should work, but in the scenario I once tried it, I wasn't sure of the result
2016-04-19 20:34:14 <bblack> and yeah there's now that:
2016-04-19 20:34:15 <bblack> < Backend-Timing: D=53212 t=1461087447104093
2016-04-19 20:34:28 <bblack> where t is epoch time generated on the MW side I believe
2016-04-19 20:34:38 <ori> yes
2016-04-19 20:34:41 <paravoid> cool
2016-04-19 20:34:49 <paravoid> so yeah, next time let's try that instead
2016-04-19 20:34:58 <paravoid> far less objects to ban :)
2016-04-19 20:35:34 <grrrit-wm> (PS1) Dzahn: ircserver: fix dependencies for running on jessie [puppet] - https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:35:42 <bblack> well, if we're confident about the window during which the ES/mc issue could cause large scale defaulting problems
2016-04-19 20:35:51 <bblack> then yeah we could've been more accurate there
2016-04-19 20:35:57 <grrrit-wm> (CR) jenkins-bot: [V: -1] ircserver: fix dependencies for running on jessie [puppet] - https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:35:59 <grrrit-wm> (PS2) Dzahn: ircserver: fix dependencies for running on jessie [puppet] - https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729)
2016-04-19 20:39:06 <wikibugs> Operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2220109 (Dzahn) since these are dnsrecursors (i addition to urldownloader), what steps have to be taken before one of them can be taken down for reinstall? any?
2016-04-19 20:40:10 <grrrit-wm> (CR) Dzahn: [C: 2] "noop on argon http://puppet-compiler.wmflabs.org/2506/"; [puppet] - https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:44:02 <ori> !log on all wikis, deleting from recentchanges where rc_timestamp > 20160419144741 and rc_timestamp < 20160419151018
2016-04-19 20:44:06 <morebots> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
2016-04-19 20:44:30 <grrrit-wm> (CR) Dzahn: "service would be running now if it wasn't for the next problem: python-irclib doesnt exist here" [puppet] - https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: Dzahn)
2016-04-19 20:44:38 <wikibugs> Operations, MediaWiki-Cache, Wikimedia-General-or-Unknown, codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2220141 (mark)
2016-04-19 20:46:33 <paravoid> fucking salt
2016-04-19 20:46:36 <wikibugs> Operations, DBA, Patch-For-Review: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#2220150 (Volans) When rolling restart also check the error log, if too big let's rotate it and compress/delete the old one bas...
2016-04-19 20:47:11 <paravoid> wtf
2016-04-19 20:47:17 <paravoid> cp1008.wikimedia.org:
2016-04-19 20:47:17 <paravoid> pc1006.eqiad.wmnet: Minion did not return. [No response]
2016-04-19 20:47:17 <paravoid> wtp2007.codfw.wmnet: Minion did not return. [No response]
2016-04-19 20:47:24 <wikibugs> Operations, Patch-For-Review, developer-notice, notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2220156 (Dzahn) The IRCd service could be starting on jessie now, the unit file is there, the dependencies are adjusted if on jessie, but the next proble...
2016-04-19 20:47:25 <paravoid> these were not in my set
2016-04-19 20:47:46 <paravoid> it first printed my set, then a bunch of of "no response" for completely unrelated hosts
2016-04-19 20:48:33 <mutante> Krinkle: fyi, no "python-irclib" on jessie is the next blocker
2016-04-19 20:48:44 <icinga-wm> RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
2016-04-19 20:49:10 <_joe_> paravoid: that's the beauty of salt
2016-04-19 20:49:28 <_joe_> you did use -v, right?
2016-04-19 20:49:32 <paravoid> it's so insanely broken all the time in ways one cannot even imagine
2016-04-19 20:50:10 <_joe_> no it's just telling you
2016-04-19 20:50:17 <_joe_> that those hosts are not responding
2016-04-19 20:50:17 <paravoid> telling me what?
2016-04-19 20:50:26 <_joe_> so it can't select their grains :P
2016-04-19 20:50:26 <paravoid> but I didn't pick those hosts
2016-04-19 20:50:34 <paravoid> well
2016-04-19 20:50:38 <paravoid> it was hundreds of hosts
2016-04-19 20:50:38 <_joe_> you did salt -G 'something' right?
2016-04-19 20:50:40 <paravoid> so how very useful
2016-04-19 20:50:50 <bblack> yeah it's semi-normal, but it doesn't always happen either
2016-04-19 20:50:51 <paravoid> -C G@, yes
2016-04-19 20:51:00 <bblack> the only way to avoid it for sure is to use batch-mode
2016-04-19 20:51:10 <wikibugs> Operations, developer-notice, notice: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2220169 (Dzahn)
2016-04-19 20:51:12 <_joe_> and since salt doesn't seem to cache that data...
2016-04-19 20:51:13 <bblack> usually now when I don't watch it batched, I do "-b 10000" or whatever
2016-04-19 20:51:28 <_joe_> bblack: rotfl
2016-04-19 20:51:29 <wikibugs> Operations, Ops-Access-Requests: Requesting access to hive for AGomez (WMF) - https://phabricator.wikimedia.org/T133102#2220183 (atgo)
2016-04-19 20:51:44 <bblack>