[07:24:15] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922554 (10scfc) Upstream has recommended just that (and declined the bug :-)). Another idea I have been thinking about in the past: Move (all) the admin web stuff to `nginx`. There is a [[https://github.... [11:21:31] 10MediaWiki-extensions-OpenStackManager: Pop-up notification when deleting an instance contains literal "$2" - https://phabricator.wikimedia.org/T123162#1922632 (10scfc) 3NEW [12:49:41] Nemo_bis: looking at grrrit-wm [12:52:35] !log grrrit-wm two instances running, k8s is only aware of one of them [12:52:35] grrrit-wm is not a valid project. [12:52:41] !log tools.lolrrit-wm two instances running, k8s is only aware of one of them [12:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [12:54:27] !log tools.lolrrit-wm running on tools-worker-1005.tools.eqiad.wmflabs according to k8s (kubectl --user=lolrrit-wm --namespace=lolrrit-wm describe pod grrrit-sm8km) [12:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [12:58:47] !log tools.lolrrit-wm I don't see it running anywhere else either. Odd. Let's kill the single one I see... [12:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [12:59:23] Thanks a lot [12:59:26] hr,. [12:59:34] Then where is grrrit-wm running?! :/ [13:01:33] insisting [13:01:33] GRRR. [13:01:48] maybe in the original tool labs project? [13:01:59] !log tools.lolrrit-wm grrrit-wm1 is handled by k8s, and thus restarts nicely. I can't figure out where the other one is running, though... [13:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [13:02:10] not on sge, as far as I can see... [13:03:08] valhallasw`cloud: ls -ltrh /data/project/lolrrit-wm shows something is going on though [13:03:50] or is that just someone following links [13:04:12] I only see access.log and error.log as reccently changed? and those are juts the webservice [13:04:18] ok [13:04:35] unless you mean something else? [13:12:41] !log tools tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting. [13:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:20:38] Nemo_bis: I'm at a loss, to be honest. [13:27:13] Nemo_bis: grrrit-wm is also reporting, right? [13:29:50] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1922698 (10valhallasw) 3NEW [13:38:21] valhallasw`cloud: yes, both are [13:38:32] ok, let me just see if I can kill the k8s one then [13:40:13] baaah. [13:42:37] whoo! [13:43:51] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1922710 (10valhallasw) I have stopped the k8s one with ``` kubectl --user=lolrrit-wm --namespace=lolrrit-wm stop replicationcontroller grrrit ``` the rogue one is still up and running. [19:56:11] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1922852 (10Superyetkin) Could you please do it? [20:24:26] (03CR) 10Nemo bis: "I guess you should abandon this and make a new changeset if you ever move/copy the new code. :)" [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua) [20:29:40] 6Labs, 10Tool-Labs: tools-checker-01 down - https://phabricator.wikimedia.org/T123173#1922879 (10valhallasw) 3NEW [20:39:14] valhallasw`cloud: I am here finally… is it too late for me to be of any use to you? [20:40:38] andrewbogott: for tools-checker I have to prod a bit in the logs, but there doesn't seem to be anything really wrong luckily [20:40:49] andrewbogott: you might be able to have an idea on how to figure out on which server grrrit-wm is running [20:40:58] because I can't find it :-D [20:41:03] hm, ok [20:41:28] (there were two of them running this morning -- one in k8s as expected and this one which is running... somewhere) [20:42:19] I think netstat on the NAT server should be able to give us a hint? [20:42:49] by ‘the NAT server’ do you mean the tools proxy or the labs-wide nat? [20:43:18] 208.80.155.255 [20:43:48] (it's an outgoing connection to freenode) [20:45:33] netstat | grep -i freenode is finding nothing so far [20:47:11] maybe under 193.219.128.49 / sendak.freenode.net? That's the server it's connected to [20:49:05] 6Labs, 10Tool-Labs: tools-checker-01 down - https://phabricator.wikimedia.org/T123173#1922906 (10valhallasw) Apparently tools-checker.wmflabs.org was already on tools-checker-02. On there, the nginx error log reports: ``` 2016/01/09 20:29:38 [error] 1065#0: *1646319 connect() to unix:/run/toolschecker/toolsch... [20:51:09] btw, are we hunting down grrrit-wm because it’s related to the alerts, or is this unrelated? [20:51:55] no, it's unrelated [20:53:36] Is T123173 the whole story about the alerts that fired an hour ago? [20:54:10] yes [20:54:52] grepping netstat isn’t finding me anything. I probably have a login on sendak if you think that would be good for anything. (I suppose that will just show me the nat IP though) [20:55:38] 6Labs, 10Tool-Labs: tools-checker-02: connect() to unix:/run/toolschecker/toolschecker.sock failed (11: Resource temporarily unavailable) - https://phabricator.wikimedia.org/T123173#1922922 (10valhallasw) [21:00:21] 6Labs, 10Tool-Labs: tools-checker-02: connect() to unix:/run/toolschecker/toolschecker.sock failed (11: Resource temporarily unavailable) - https://phabricator.wikimedia.org/T123173#1922925 (10valhallasw) ``` 21:54 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRI... [21:00:37] valhallasw`cloud: should we try a fail-over of tools-checker to -01 so we can reboot -02? (I don’t know how to do the failover but surely it’s straightforward) [21:00:45] andrewbogott: -01 is completely dead to begin with [21:00:52] oh :( [21:00:57] Well, should I reboot /that/ one then? [21:01:01] but has been for a while if I can believe graphite, so rebooting sounds good [21:02:06] !log tools rebooting tools-checker-01 as it is unresponsive. [21:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [21:03:33] andrewbogott: I don't get what's happening, but the gist is the host has resource issues every now and then, this stays for an hour or so and it's fine again [21:04:24] and it happens every few weeks [21:04:28] The virt host looks fine, at least... [21:04:36] 6Labs, 10Tool-Labs: role::labs::tools::proxy tries to create /etc/kubernetes/kubeconfig without requiring /etc/kubernetes - https://phabricator.wikimedia.org/T123176#1922926 (10scfc) 3NEW [21:05:20] ok, checker-01 is back alive [21:05:50] yeah, it looks ok to me [21:06:16] do you know how to move the service over? [21:06:30] yeah [21:06:33] is it just a dns change? [21:06:35] 6Labs, 10Tool-Labs: Replace reference to liblua5.1-json with lua-json - https://phabricator.wikimedia.org/T123177#1922935 (10scfc) 3NEW [21:06:39] it's an IP change [21:07:03] ah — ok, surely you have the rights to that, want to do the honors? [21:07:09] sure! [21:07:25] !log tools moved tools-checker/208.80.155.229 back to tools-checker-01 [21:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:08:28] 6Labs, 10Tool-Labs: tools-checker-02: connect() to unix:/run/toolschecker/toolschecker.sock failed (11: Resource temporarily unavailable) - https://phabricator.wikimedia.org/T123173#1922944 (10valhallasw) We rebooted checker-01, and moved the floating IP there. -01 is now serving the catchpoint checks: ``` 20... [21:08:53] hm. So checker-01 is using ~500M memory while -02 has 1.8GB in use [21:09:11] so must be a leak someplace [21:10:03] well, maybe. We can reboot -02 and see if it lines up with -01 [21:10:11] there's nothing that's obviously using a lot of memory, though :/ top looks pretty comparable [21:10:50] When you say it’s ‘using’ 1.8GB, you mean according to ‘free’? [21:11:26] yeah, but that of course also takes buffers into account [21:11:54] ignoring those it's 1.8GB free for -01 and 1.4GB for -02 which is less crazy [21:12:24] ~ same number of processes on both [21:12:25] ok, I think my brain is just too fried to actually debug this sanely [21:12:48] ok. Hopefully the failover will keep things happy for a while at least. [21:13:02] Sorry I’m no help about grrrit [21:13:03] thanks for your help [21:13:09] heh. [21:13:21] well, I killed the k8s one so at least there's only one working now [21:15:47] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1922945 (10Andrew) [21:18:46] 6Labs, 10Tool-Labs: tools-checker-01 denies access with ssh - https://phabricator.wikimedia.org/T122470#1922947 (10scfc) 5Open>3Resolved a:3scfc The reboot seems to have fixed it. [21:18:55] 6Labs, 10Tool-Labs: tools-checker-01 denies access with ssh - https://phabricator.wikimedia.org/T122470#1922950 (10scfc) a:5scfc>3valhallasw [21:37:43] (03PS1) 10Legoktm: Don't install dev dependencies via composer [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263187 [21:38:07] (03CR) 10Legoktm: [C: 032] Don't install dev dependencies via composer [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263187 (owner: 10Legoktm) [21:38:38] (03Merged) 10jenkins-bot: Don't install dev dependencies via composer [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263187 (owner: 10Legoktm) [21:46:04] (03PS1) 10Legoktm: Allow specifying repos via --repo= [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263188 [21:46:22] (03CR) 10Legoktm: [C: 032] Allow specifying repos via --repo= [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263188 (owner: 10Legoktm) [21:46:50] (03Merged) 10jenkins-bot: Allow specifying repos via --repo= [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263188 (owner: 10Legoktm) [21:50:37] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1922957 (10jcrespo) I think someone beat me to do it? [21:55:32] (03PS1) 10Legoktm: Pass -x to git clean to get rid of everything [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263190 [21:56:03] (03CR) 10Legoktm: [C: 032] Pass -x to git clean to get rid of everything [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263190 (owner: 10Legoktm) [21:56:37] (03Merged) 10jenkins-bot: Pass -x to git clean to get rid of everything [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263190 (owner: 10Legoktm) [22:07:35] (03PS1) 10Legoktm: Allow ^C'ing to work [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263193 [22:07:54] (03CR) 10Legoktm: [C: 032] Allow ^C'ing to work [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263193 (owner: 10Legoktm) [22:08:23] (03Merged) 10jenkins-bot: Allow ^C'ing to work [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/263193 (owner: 10Legoktm) [22:21:10] valhallasw`cloud, I only get error messages from the API when I used the revids parameter. [22:21:19] ? [22:21:30] Can you give me an example of a query fetching specific revisions of content? [22:22:12] take your query, remove all generator stuff, put &revids=1 in, done? [22:22:20] I did. [22:22:23] I think. [22:22:33] I'm not good with the generator stuff. [22:23:19] Ah there we go. [22:23:22] Got it. [22:23:33] That feature is very strict. : [22:23:34] :p [22:57:39] 10Labs-project-extdist: extdist is not deleting some tarballs for master - https://phabricator.wikimedia.org/T123180#1922990 (10Legoktm) 3NEW