[00:03:59] 10Tool-Labs-tools-Other: tmg/articlemedia tool not working - https://phabricator.wikimedia.org/T89695#1099414 (10hoo) [05:02:04] can someone restart quick_intersections ? [07:26:10] 10Tool-Labs, 7Tracking: Missing Toolserver features in Tools (tracking) - https://phabricator.wikimedia.org/T60791#1099552 (10jeremyb) [07:26:12] 10Tool-Labs: Provide namespace IDs and names in the databases similar to toolserver.namespace - https://phabricator.wikimedia.org/T50625#1099550 (10jeremyb) 5Resolved>3Open MariaDB [s51892_toolserverdb_p]> select * from namespacename; ERROR 144 (HY000): Table './s51892_toolserverdb_p/namespacename' is marked... [07:26:12] 10Tool-Labs-tools-Other: Migrate http://toolserver.org/~dispenser/* to Tool Labs - https://phabricator.wikimedia.org/T68868#1099554 (10jeremyb) [07:26:13] 10Wikimedia-Labs-Other, 6operations, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1099553 (10jeremyb) [08:30:22] GerardM-: quick intersection seems to work for me? [08:30:23] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1099591 (10Akoopal) All tools are tools that do some larger queries, but as far as I can see, only order of couple of seconds, not more. I have not gone ove... [08:36:12] 10Tool-Labs: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099596 (10scfc) Regarding the first case (`persondata`), there is no entry in the proxy list on the proxies. I had compared that list on `tools-webproxy` just a few days ago, but didn't think about `tools-webproxy-01`... [08:37:05] 10Tool-Labs: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099597 (10yuvipanda) Aaargh, looks like this is my fault for deleting tools-webproxy without patching portgrabber and associated things.... [08:48:19] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099621 (10yuvipanda) Am attempting to run puppet on all the affected nodes via salt now. [08:50:59] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099622 (10yuvipanda) Alright, so that's done. What now? [09:00:30] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099627 (10scfc) So after the quick merge & deploy by @yuvipanda, they no longer hang, but do not register on `tools-webproxy-01`/`tools-webproxy-02`: ``` scfc@tools-webproxy-02:~$ echo KEYS prefix... [09:03:19] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099631 (10yuvipanda) Looking at /var/log/proxylistener, they both stop at around the same time with: `2015-03-09 05:30:25,515 Set redis key prefix:xtools-articleinfo with key/value .*:http://tools... [09:06:11] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099636 (10scfc) Could this be one of these conntrack issues or something similar where we hit a networking limit? I don't know enough about that for debugging, but if there's someone knowledgeable... [09:10:00] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099638 (10scfc) Looking at `/var/log/syslog`, at Mar 9 06:01:01 the `resolv.conf` patch was deployed. [09:16:44] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099642 (10scfc) ``` scfc@tools-webproxy-02:~$ sudo lsof -p 1008 | fgrep TCP | wc -l 1019 scfc@tools-webproxy-02:~$ ``` (1008 = proxylistener.) 1024? [09:18:41] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099643 (10scfc) Looks that way: ``` scfc@tools-webproxy-02:~$ sudo cat /proc/1008/limits Limit Soft Limit Hard Limit Units Max cpu time un... [09:32:38] 6Labs: webservice2 on magnustools not working - https://phabricator.wikimedia.org/T91952#1099651 (10Magnus) 3NEW [09:33:47] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099658 (10scfc) @yuvipanda: Do you want to restart the proxylistener service or try to change the limits on the running process? [09:34:57] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099659 (10yuvipanda) Changing the limits on the running process would be good, if possible. If not, we can do a qmod based restart and then pick up stranglers from redis key list? [09:37:37] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099661 (10yuvipanda) Alright, looks like we can't really change the limits on the running process... [09:38:31] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099662 (10scfc) I just compiled prlimit. Should I give it a try? [09:38:39] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099663 (10yuvipanda) As a side note, we should find a way to get rid of this persistent connections architecture of proxylistener... [09:38:48] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099666 (10yuvipanda) Aha! Yes you should [09:52:31] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099682 (10scfc) Wow, that was troublesome until I finally ended up with: ``` root@tools-webproxy-02:~# LD_PRELOAD=/home/scfc/usr/lib/libsmartcols.so /home/scfc/usr/bin/prlimit -p 1008 --nofile=819... [10:13:21] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099684 (10scfc) a:3scfc `tools-webproxy-01`'s `proxylistener` also now has 8192 (even has the same pid, had to look twice :-)). `wikihistory` was in the lot that I had to restart because it wasn... [10:17:51] 10Tool-Labs: Move list of proxies to hiera - https://phabricator.wikimedia.org/T91954#1099702 (10scfc) 3NEW a:3scfc [10:17:57] 10Tool-Labs, 5Patch-For-Review: Problems with web proxy for Tool Labs - https://phabricator.wikimedia.org/T91939#1099711 (10yuvipanda) @scfc <3 [10:30:53] I can't access the http://tools.wmflabs.org domain from within tools anymore, command-line brower shows that it points to some devwiki instead [10:37:10] sitic: hey6 [10:37:39] sitic: is this your tornado app? [10:39:00] no, some botscript. eg "curl http://tools.wmflabs.org/catscan3/" gives a 404 from within tools, but it works in my normal webbrower [10:39:22] sitic: which host were you curling from? [10:39:32] tools-login [10:40:14] actually tools-dev [10:40:26] * sitic sshconfig is messed up [10:41:13] sitic: looking [12:00:46] hey. is it possible to increase the storage amount of instances? [12:01:37] jzerebecki: hey! by default only 20G is allocated, youy can use lvm to allocate the rest [12:01:55] jzerebecki: there is a srv role in wikitech that allocates everytihng to /srv [12:02:05] YuviPanda: i mean increase it behond that [12:02:24] jzerebecki: right. not atm. we’ve had what, 1 request for this before and I think we turned them down... [12:02:50] YuviPanda: is there some manual way? [12:02:59] for you to do it? [12:03:15] jzerebecki: nope. we have to add a new flavor, and I think there might even be some hardware limitations. andrewbogott_afk would know better [12:29:16] we should really add cinder to our openstack installation [12:30:49] jzerebecki: I think andrewbogott_afk was looking into it at some point. [12:31:07] jzerebecki: part of the trouble is that only andrewbogott_afk is working on the OpenStack infrastructure itself, and currently we’re just trying to get rid of Wikitech / OpenStackManager [12:34:48] YuviPanda: there was talk about allowing non-openstack-admins to access the nova api (like for cli tools) is that what https://phabricator.wikimedia.org/T49515 is about (which means its project is wrong)? [12:35:16] so that mere lab project admins could like stop vms or resize them. [12:35:35] jzerebecki: heh, yeah, that’s still on the cards… at some point in teh future? [12:36:41] is that the correct ticket or should i create a new one? [12:41:01] jzerebecki: commenting on that would be enough, I think [12:42:16] thx [13:06:42] YuviPanda: What do you think about T91806? Passive check? [13:07:17] YuviPanda: Alternately, we could monitor side effects instead (that the exports have been touched in the last n minutes, say). [13:16:03] Coren: I… am not sure, actually. [13:16:57] Coren: so…. one twisted way to do it would be to have ‘boolean’ metrics being reported somewhere about it running, and then test on XOR [13:17:13] Oh, ew! [13:17:37] I mean yeah, it'd work, but "ew!" [13:17:45] right [13:17:48] can’t think of anything else [13:18:07] Let's cogitate on it some more. [13:18:10] yes [13:18:12] I agree [13:18:33] Coren: in the meantime, can you puppetize / wrap around the nfs deamon on the hosts so that they won’t run at all, and point to start-nfs? [13:18:52] * Coren ponders. [13:19:02] Those upstart/systemd scripts are package-managed. [13:19:10] right. [13:19:21] and repackaging would be a PITA [13:21:15] I think I'm inclined to trust opsen. Perhaps a very in-your-face motd reminder? [13:21:34] yeah, at least that [13:22:10] Coren: things slip through. like when andrewbogott_afk was considering restarting the NFS servers after ghost he couldn’t find start-nfs... [13:22:48] Coren: also, me and scfc_de figured out the proxy issue and fixed it. [13:26:04] Yeay! I tried for a while by looking at the proxy logs, but Sunday night past bedtime is not conductive to strokes of genuis. I knew you'd be there in a couple of hours. [13:29:09] Coren: https://phabricator.wikimedia.org/T91939 [13:29:15] the proxylistener had run out of open file handles [13:29:31] so we (mostly scfc_de) performed live surgery and increased the limit without shutting down the process [13:29:34] Sooo many webservices!' [13:29:48] Coren: thoughts on my comment in https://phabricator.wikimedia.org/T90369 [13:30:33] Coren: ugh, there’s like 20 xtools-articleinfo jobs (https://tools.wmflabs.org/?status) [13:30:36] can you take a look? [13:33:07] <^d> YuviPanda: morning bro. [13:33:13] ^d: hey bro [13:33:21] * YuviPanda tips over ^d’s motorcycle [13:33:59] <^d> True story: a drunk in my neighborhood got arrested the other night for kicking over some motorcycle in front of the building next to mine :p [13:34:48] hahaha [13:34:56] nicd [13:34:58] *nice [13:35:44] YuviPanda: I'm guessing it got cray-cray during the proxy being ill - there's a cronjob that looks for actual proper responses from the webservice and restarts it otherwise; but it has no guard against running-but-not-proxied. [13:35:55] right [13:36:21] there’s still stuff running on the tomcat node, for some reason [13:36:28] why is the enwp10-update running on tomcat node... [13:49:03] Coren: can you also respond to paravoid’s questions on https://gerrit.wikimedia.org/r/#/c/194865/ [13:54:56] YuviPanda: Hi! [13:55:13] YuviPanda: Come back to you about /data/project... [13:56:10] YuviPanda: Basically it works, in particular in write mode, I don't think mwoffliner is slowed down due to network latencies... at least it's not obvious [13:56:28] Kelson42: \o/ yay [13:56:35] YuviPanda: but there is a big problem, at the time you come to zimwriterfs, that means reading all files to make a ZIM file... then it dammned slow [13:56:51] Kelson42: right. NFS is… a bit slow, to say the least. [13:57:18] YuviPanda: It's NFS, or it's the hardware to your opinion? [13:57:30] Kelson42: Coren might be able to answer that better [13:57:41] but NFS in general isn’t going to be as fast as instance storage (/srv and friends) [13:57:50] YuviPanda: because it's really really slower, it takes me more than a day to make a ZIM of wikispecies (~400.000 articles). [13:57:59] YuviPanda: Actually, the performance of NFS is within the order of magnitude of the "local" filesystem for most use cases; it's not all that hot at very long contigous reads because the disk bandwidth needs to be shared. [13:58:09] right. [13:58:18] Kelson42: how much time did it take with /srv? [13:59:13] YuviPanda: a few hours, I'm going to make more accurate benchmarks... [13:59:28] a few hours to 24h doe sseem like an order of magnitude. [13:59:44] Kelson42: have you considered asking the WMF for dedicated hardware? [14:01:17] YuviPanda: everything is open for now.... we are trying to find solution with current setup, I have shared vision+first feedbacks on the ML.... [14:01:50] Kelson42: right. [14:02:04] YuviPanda: I don't know what is the best approach, but maybe putting a little bit more hardware and having a xx-large instance would be an approach [14:02:24] Kelson42: right, but do you think you can run enwiki ZIM generation this way? [14:02:38] YuviPanda: you mean on wmflabs? [14:02:58] Kelson42: yeah [14:05:01] YuviPanda: absolutly! For the really big Wikipedias I just need a bigger /srv... for the rest x-large does the job. I have already achieved to make WPNL zim files on mwoffliner2.wmflabs.org [14:07:50] YuviPanda: do you have any plan to make custom VM (that means choosing mass-storage+RAM+number of cores)? [14:26:19] Kelson42: not that I know of. file a bug? [14:27:46] YuviPanda: Yes, I guess, all the VM backend can deal with this... probably only a OpenStackManager improvement. [14:29:14] Kelson42: so what you need is a new ‘flavor’, which is a CPU + RAM + Disk combo [14:30:05] YuviPanda: IMO, this would help to better allow resources, for example, I probably don't need more than 8GB of RAM. [14:30:14] right [14:30:24] Kelson42: so I suggest filing a bug, and let’s poke andrewbogott and see what he thinks [14:31:04] YuviPanda: this might be a solution to better allocate resources of the cluster [14:31:20] Kelson42: yup, yup. I agree. [14:57:55] YuviPanda, Kelson42, there’s already a custom flavor with an extra-large drive, I think — let me check. (I have to assign it to projects explicitly) [14:58:45] andrewbogott: ohoho.... :) [14:58:56] what specs do you need, specifically? [14:59:21] (looks like actually the existing custom flavors are faster, not bigger. But I can make a big, not fast one.) [15:05:03] bd808, Reedy, YuviPanda: jouncebot is missing, please restart? [15:05:10] oh, alright [15:05:11] * YuviPanda looks [15:05:21] andrewbogott: It's a little bit difficult to say, but to mirror the +1.5M Wikipedias, something with x2 storage (so ~300GB) might allow to do the job [15:06:00] anomie: rebooted. [15:06:11] Kelson42: and, how many such instances would you need? [15:06:37] (and why doesn’t NFS work for you? Is this a db?) [15:07:10] andrewbogott: NFS slows down pretty seriously the process... otherwise it works [15:07:44] andrewbogott: NFS multiply the whole dumping time at least by 3 [15:08:20] andrewbogott: which is still a little bit surprising to me... [15:09:38] some nova admin please resize wdq-bg1, wdq-bg2, wdq-bg3.eqiad.wmflabs in https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query from m1.large to m1.xlarge [15:10:38] andrewbogott: maybe, it would be better to check if the NFS storage can not be speed-up, instead of having bigger /srv.... [15:11:07] Kelson42: 1.5Tb on the NFS server is something we can provide right now. 1.5Tb of local instance storage would require us to budget an additional virt node, which might be OK but will take a while to push through. [15:11:22] So, yeah, I’d encourage you to research NFS options for the time being, because that’s where the space is. [15:11:43] (One virt node that would normally host 40-50 labs instances has about 2 Tb of storage) [15:12:15] Also, "speed up NFS" is not really something that can be done - that thing already has really really /good/ performance but has to be shared fairly between all the projects. [15:12:17] andrewbogott: yes, I'll deliver a detailed bench. [15:13:02] Kelson42: You may be able to improve performance, however, by making certain you do not partial page writes for instance, with buffering as required. [15:13:07] Coren: right, NFS can’t be made faster, but Kelson’s tools maybe can, by e.g. making fewer larger file writes or caching [15:13:07] Coren: having a more robust (better disk redondancy) on the NFS server would not be an option? [15:13:14] ah, yes, as you say :) [15:14:11] Coren: I suspect that the problem is not NFS, but the I/O on the diskarray behind. [15:14:20] Kelson42: ... just so you know, the NFS server has 72 drives over five channels spread on two controllers; arranged in a striping of raid 6 arrays. I doubt you can do more redundancy. :-) [15:14:53] Coren: sounds like a good argument. [15:15:34] Just keep in mind that there are a LOT of users to share this with - being careful about your use pattern is the key to winning. :-) [15:18:07] Don't update files, favour creating new ones. Write in 64K chunks, 4M if you can swing it. Never write less that 4k. Same for reading. [15:20:24] Coren: there are still fore sure a few optims still to do, but You can not really change the fact the you have to deal with millions of entries/files for big Wikipedias... [15:21:18] some nova admin please resize wdq-bg1, wdq-bg2, wdq-bg3.eqiad.wmflabs in https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query from m1.large to m1.xlarge [15:21:52] andrewbogott: ^ do we support resizing instances now? Last I remember asking we didn’t... [15:21:52] No doubt, but as with many things, one has to balance how much speed you can get vs how much money you can throw at the problem. Giving you this quantity of local disk may be the best solution, but it's one that needs resources allocated to. [15:21:54] but that was a year ago [15:22:15] YuviPanda: no. Openstack claims support for it but I’ve never seen an instance survive the operation. [15:23:31] Kelson42: So I guess the speed solution in this case is convince someone who holds a budget that more speed is worth the investment. :-) [15:24:56] jzerebecki: you’ll need to build new instances and migrate the services. If you need larger quotas in order to make space to do the shift just left me know. [15:25:29] thx [15:25:30] Coren: Yes, this is probably part of the solution. Implementing a few SW improvements too. Will try to do both :) [15:26:23] Kelson42: Look at it from the bright side. Two years ago you'd have had to deal with glusterfs instead. :-) [15:26:42] Coren: fun graph: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1425914749.503&target=tools.tools-redis.redis.6379.memory.internal_view.value&target=tools.tools-redis.redis.6379.memory.external_view.value&from=00%3A00_20150109&until=23%3A59_20150309 [15:27:09] the only reason redis isn’t dying atm is because I am makign it evict things... [15:27:13] YuviPanda: Redis seems to be popular. [15:27:14] allkeys-lru rather than volatile-lru [15:27:18] Coren yes, I do it! [15:27:28] Coren: yup. I somehow suspect a runaway tool that just doesn’t remember to set ttls, however [15:27:57] YuviPanda: Ah; and no match from keys to tools. :-( Hm. [15:28:09] Coren: nope. [15:28:17] YuviPanda: How much logging does it do atm? Tracking down outliers might be worthwhile. [15:28:20] Coren: also, we don’t actually have an easy way of seeing what keys are there, even :P [15:28:25] Coren: we can tail the AOF file [15:28:51] that’s practically tailing the binlog, so will give us everything [15:28:57] RECOVERY - Puppet staleness on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [3600.0] [15:29:37] I mean, redis is explicitly documented and stated to be volatile; so expelling is workable in the short time. But if someone is causing too fast a rotation it's still worthwhile to pursue. [15:30:08] yup, yup [15:30:20] I think the interesting data point is volatile-lru not having any effect at all [15:34:40] Not enough effect or no visible effect? [15:35:01] Coren: right. no visible effect. [15:35:10] Coren: so basically I found out when grrrit-wm died [15:35:25] because redis was at maxmemory and didn’t have anything to kill [15:35:33] so that basically meant that even after chucking out anything with a ttl [15:35:36] it’s still not empty [15:35:36] err [15:35:38] it’s still full [15:43:06] Well, that means either most people don't set a ttl, or that a few tools fill all of it with no ttl. [15:43:13] Coren: yup [15:43:16] and we need to find out which [15:43:36] y’know what’ll be hilarious? if we look at the logs and find out that it’s grrrit-wm or wikibugs [15:44:21] Well, you should be in a position to check that at least. :-) [15:45:52] Coren: also, have you seen https://phabricator.wikimedia.org/T90561 [15:45:59] I’m going to get on that sometime sooon [15:47:34] Sounds about right. The "statelessness" principle I definitely agree with. [15:53:10] Coren: cool. also while I have you, https://phabricator.wikimedia.org/T90369 [15:54:05] I'm not clear on exactly what the status on gridengine is. It's in *very* active use in the scientific and engineering communities and I don't believe it's going to be orphaned anytime soon. [15:54:43] Absolutely every bit of software having to do with CFD is written for gridengine, for instance. [15:55:17] Coren: I don’t think scientific and engineering communities are a high bar to look at. See also: Fortran [15:55:45] Fortran is very good at some things in a way that makes it really hard to replace because no better alternatives. :-) [15:56:15] And yes, I spent a few years maintaining exactly this not that long ago (fortran running code for CFD running over gridengine) :-) [15:56:27] heh, do you want everyone to share your pain? :) [15:56:49] I can live without the fortran for sure. :-) [15:56:56] I can live without the OGE too :) [15:57:16] still, this gives us a 3 year runway to replace OGE. [15:57:19] It's kinda clunky, but it does its job well. [15:57:26] I would highly dispute the ‘well' [15:57:32] But yeah, long term indeed. [15:58:31] qstat -xml outputs very, very obviously malformed XML. There’s no API to properly automate it. The fact that bigbrother needs to exist is a sham. the docs suck. There’s nobody using it anymore outside of people who also run Fortran. Even debian’s gonna kick it out. [15:59:07] Coren: one of the things I’m doing is carefully following the SOA stuff that’s going on in prod. We might be able to steal some of it. [16:02:18] Sure, once on of those things mature it may be worthwhile to look at it. [16:02:34] yup [16:32:08] YuviPanda / Coren: Busy getting labs in shape? ;-) [16:32:41] is… it out of shape atm? [16:33:01] multichill: I spent last two weeks adding enough redundancy that it’ll survive hardware failure much better :) [16:33:07] multichill: anything *particularly* out of shape? [16:33:45] That's nice! Does the platform already support live migration of vm's? [16:35:35] multichill: live migration sort of works, but there are various issues that make it less-than-reliable. For one thing it only works when the host we’re migrating from isn’t broken :/ [16:36:16] multichill: yeah, so we just added redundancy. tools-webproxy-01 and -02, we tested tools-master and tools-shadow, and added enough extra nodes so that tools will still keep running even if some nodes are down [16:36:23] redis is still SPOF tho [16:36:26] and so is bigbrother... [16:36:35] Bigbrother should die [16:36:41] :P [16:36:51] multichill: totally agree [16:36:56] Is the load balancer redundant? [16:36:57] multichill: https://phabricator.wikimedia.org/T90534 is tracking bug for ‘make toollabs reliable enough' [16:37:00] guys, don’t say things like that, he’s watching! [16:37:20] multichill: the webproxy? it is. two machines, hotswappable (manually) [16:37:41] multichill: that was the first thing I did :) [16:38:07] multichill: this is ‘die bigbrother die’ task https://phabricator.wikimedia.org/T90561 [16:39:10] If I would do a redundancy review of an enviroment like this I would go bottom up from the infrastructure --> vm's. [16:39:23] multichill: yup, agreed. new machines are on the way... [16:39:38] Is the power proper A/B? Storage redundant, network dual connected, etc etc [16:39:41] multichill: I wrote a long email about how we had one machine as a lifeboat spare... [16:40:01] multichill: ah, good point. I had just assumed those are all good, and then not thought about those. [16:40:10] I just got an email from Mark that the new servers in the Netherlands already arrived, but these are not for labs of course [16:40:10] now that I think about it, everyone hates the ciscos... [16:40:15] nope [16:40:37] UCS is good stuff, but you have to properly configure it [16:41:12] (Cisco UCS that is) [16:41:21] yeah [16:41:23] one step at a time :) [16:41:28] we’re just moving off the ciscos, though. [16:41:33] bigger faster better HPs [16:42:15] We work in generations. Current one is Cisco UCS, maybe next year something different [16:43:27] btw, YuviPanda, we have approved budget to replace all the ciscos. Getting quotes today. [16:44:02] it’ll take a few weeks before we have hardware though [16:44:53] YuviPanda: Would be cool if tool owners could setup a Nagios check for their tool and get notifications [16:45:03] andrewbogott: yup, saw. yay [16:45:14] multichill: yup, that’s part of the ‘kill bigbrother’ thing [16:45:29] Of course with proper parent -> child so that if the webserver goes down, we don't get spammed [16:45:48] yup [16:46:24] multichill: I have a feeling toollabs stability will be an ops/labs goal for next quarter. [16:48:14] +1 [16:50:12] multichill: anyway, file more bugs as children of that tracking bug :) You’ve been ‘user’ of both tools / toolserver, so would appreciate the perspective [16:53:10] Btw, did you have a look at wdq YuviPanda? Seems to be stalled again [16:53:32] Lost track with the different bugs..... [17:24:28] !log staging created staging-tin, setting up puppet and salt keys [17:24:32] Logged the message, Master [17:24:46] multichill: it’s not been doing that properly for a while now. No idea what’s causing it. probably a C++ bug somewhere [18:08:21] !bigbrother [18:27:55] how often does bigbrother check? [18:28:05] I just killed my webservice to test that I set it up correctly [18:32:37] Coren: ^ ? [18:32:51] legoktm: Every 60s IIRC [18:33:10] hmm [18:33:26] tools.checker@tools-trusty:~$ cat ~/.bigbrotherrc [18:33:26] webservice [18:33:27] legoktm: But: it will only restart a service if it saw it running [18:33:30] doesn't seem to have restarted [18:33:31] oh [18:33:44] I.e.: if you stop, then add .bigbrother it won't work until you start it manually once. [18:33:52] gotcha [18:34:05] I restarted it and will cross my fingers that it works when it goes down :P [18:37:47] !bigbrother is https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid#Bigbrother [18:37:47] Key was added [18:38:24] legoktm: You can also try it by stopping by hand once it's been seen as running. [18:42:08] Coren: seen https://phabricator.wikimedia.org/T90800#1099591? [19:58:30] (03PS1) 10Yuvipanda: add dummy private key for l10nupdate [labs/private] - 10https://gerrit.wikimedia.org/r/195351 [19:58:49] (03CR) 10Yuvipanda: [C: 032 V: 032] add dummy private key for l10nupdate [labs/private] - 10https://gerrit.wikimedia.org/r/195351 (owner: 10Yuvipanda) [20:03:31] (03PS1) 10Yuvipanda: Add l10nupdate dummy public key as well [labs/private] - 10https://gerrit.wikimedia.org/r/195354 [20:03:49] (03CR) 10Yuvipanda: [C: 032 V: 032] Add l10nupdate dummy public key as well [labs/private] - 10https://gerrit.wikimedia.org/r/195354 (owner: 10Yuvipanda) [20:17:40] 2015-03-03 22:31:45,290 - irc3.wikibugs - DEBUG - > QUIT :INT [20:17:41] nice [20:18:14] !log tools.wikibugs legoktm: Deployed 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt wb2-irc [20:18:16] Logged the message, Master [20:24:12] hi wikibugs [20:24:32] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1101296 (10Legoktm) ! [20:26:47] !log tools.wikibugs Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt [20:26:48] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1101300 (10Legoktm) . [20:26:50] Logged the message, Master [20:27:02] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1101301 (10Legoktm) ... [20:36:06] !log tools.wikibugs Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt [20:36:10] Logged the message, Master [20:50:02] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1101384 (10RobH) @andrew, Is there an actual hardware issue with virt1000? Just being out of warranty is typically not enough to sunset use of a system; and is usually coupled... [21:01:49] 6Labs, 10hardware-requests, 6operations: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1101461 (10RobH) You can grep over https://wikitech.wikimedia.org/wiki/Server_Spares and advise which would suit your needs. [21:06:08] 6Labs, 10hardware-requests, 6operations: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1101484 (10Andrew) A single w/16Gb would be great. I'm not sure what to do about networking, though -- it needs to talk to labs servers on rabbitmq but also (in theory) run a public DNS server. Hm. [21:11:05] hi, I'm on the tools-login server, and when I curl http://tools.wmflabs.org or http://tools-webproxy I get a 301. This started happening around 11:00 GMT yesterday, 8 May [21:11:16] when I curl those URLs on my own machine I get a 200 [21:11:31] I can't figure why [21:11:32] !log tools.wikibugs legoktm: Deployed 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt wb2-irc [21:11:34] Logged the message, Master [21:11:42] bah [21:11:43] stupid [21:12:28] e.g. try: curl -I -s -m 20 "http://tools-webproxy" | grep HTTP [21:12:40] or: curl -I -s -m 20 "http://tools.wmflabs.org" | grep HTTP [21:12:55] that should be a 200 [21:14:35] legoktm Coren any ideas? [21:14:51] no idea [21:15:32] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1101540 (10Legoktm) ? [21:16:17] 6Labs, 10Wikimedia-Labs-wikitech-interface: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1101542 (10Andrew) 3NEW a:3Andrew [21:20:54] 6Labs, 10Wikimedia-Labs-wikitech-interface: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1101586 (10Andrew) The code that writes is here: https://github.com/wikimedia/mediawiki-extensions-OpenStackManager/blob/master/nova/OpenStackNovaUser.php#L105 Probably t... [21:21:09] 6Labs, 10Wikimedia-Labs-wikitech-interface: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1101588 (10Andrew) a:5Andrew>3Springle [21:21:10] hi legoktm, any reason why https://tools.wmflabs.org/globalprefs/ is not running? I recall that this was probably the OAuth service which lets you set your global interface language. [21:22:24] se4598: running now :) [21:22:47] thanks! [21:23:29] iirc that tool is a giant mess and needs refactoring :/ [21:24:37] eh, it's still using fcgi :| [21:32:10] 6Labs, 10Wikimedia-Labs-wikitech-interface: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1101686 (10Andrew) [21:32:42] Coren: posssible culprit (or at least partial culprit) for “Have you tried turning it off and back on again?” wikitech bug ^ [21:34:51] Also I’ve learned a bit about why it behaves so poorly… there’s not really any case where we can tell the difference between a bad token (which that bug produces) and a good-but-you-aren’t-in-this-project token. [21:43:46] That would be... ugh. [21:45:20] But wait; what would cause a load exactly? [21:47:37] Coren: did you happen see my question above? curl -I http://tools.wmflabs.org returns a 301 instead of a 200 on the login server, starting at around 11:00 GMT yesterday. Any ideas what that could be? [21:48:36] $ curl -I http://tools.wmflabs.org/ [21:48:36] HTTP/1.1 200 OK [21:48:50] did you try that on the login server? I get a 200 on my local [21:49:46] That's... downright daft. [21:49:48] Huh. [21:50:01] if I try to hit a tool in particular, like curl -I http://tools.wmflabs.org/xtools it returns a 404 [21:50:35] Oh d'oh. [21:50:45] we have shell scripts that automatically restart the service based on bad responses, they don't work anymore :( [21:50:45] Sideeffect of YuviPanda|zzz fiddling with the proxies. [21:51:54] I figured it was proxy something or another. Do you know how to fix it or should I wait for YuviPanda|zzz to wake up and ask them? [21:52:10] Fixng [21:52:21] yay! [21:54:59] back in working order. Thanks Coren! [22:05:05] Coren: a load would happen if the token exipired in memcache, or we cleared memc for some reason. the other half of the puzzle is https://gerrit.wikimedia.org/r/#/c/195350/ [22:05:45] So I suspect that some upgrade of keystone increased the token size, causing things to break, causing me to fix it in memc — the wrong place for a fix and also only fixed for certain cases. [22:11:20] 6Labs, 10hardware-requests, 6operations: eqiad: (5) virt nodes - https://phabricator.wikimedia.org/T89752#1101968 (10RobH) Quotes are back in and escalated to @mark for approvals process on https://rt.wikimedia.org/Ticket/Display.html?id=9249 [22:21:45] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:22:50] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [22:23:20] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [22:23:26] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:24:54] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:24:54] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [22:26:46] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [22:27:08] PROBLEM - Puppet failure on tools-exec-11 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [22:27:18] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:27:28] PROBLEM - Puppet failure on tools-exec-06 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [22:31:51] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:31:51] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:31:57] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [22:32:25] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:32:35] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:32:49] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:33:20] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:34:32] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [22:35:33] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [22:35:47] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:36:55] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:36:55] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [22:36:57] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [22:37:07] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:37:07] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [22:38:17] PROBLEM - Puppet failure on tools-webgrid-generic-02 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [22:38:21] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [22:39:03] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [22:39:33] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:39:53] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:39:54] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [22:42:02] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:53:08] 6Labs, 10Wikimedia-Labs-wikitech-interface: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1102115 (10Springle) I've altered openstack_tokens.token to varbinary(2048). Presumably we also need to patch the extension? https://git.wikimedia.org/blob/mediawiki%2Fext... [22:55:09] 10Tool-Labs: webservice2 failing to start Python web services with xml.etree.ElementTree.ParseError - https://phabricator.wikimedia.org/T92039#1102116 (10Mattflaschen) 3NEW [22:57:17] 6Labs, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1102126 (10Andrew) Probably -- is that as simple as this? https://gerrit.wikimedia.org/r/#/c/195472/ [23:01:13] * andrewbogott is fixing all of those shinken warnings [23:17:33] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [23:17:48] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [23:20:40] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [23:21:40] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:21:54] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [23:22:02] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [23:22:10] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [23:23:20] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:23:22] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [23:24:15] At this point shinken-wm should really just say “…and so on." [23:24:38] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [23:24:38] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:25:46] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [23:26:54] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [23:27:05] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:27:05] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [23:27:05] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [23:28:25] RECOVERY - Puppet failure on tools-webgrid-generic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:29:11] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [23:29:53] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [23:29:53] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [23:31:37] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:32:57] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [23:33:05] Coren: you around? [23:33:19] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [23:33:21] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [23:34:55] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:34:56] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [23:36:47] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [23:36:47] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:09] RECOVERY - Puppet failure on tools-exec-11 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:21] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:21] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:31] RECOVERY - Puppet failure on tools-exec-06 is OK: OK: Less than 1.00% above the threshold [0.0] [23:50:42] 10Tool-Labs: Missing page revisions on enwiki - https://phabricator.wikimedia.org/T74226#1102341 (10Earwig) 5stalled>3Resolved Interesting. It seems like the problem has been resolved, although 16 pages still result from the above query. However, that looks like accurate replication of a corrupted database r... [23:50:43] 10Wikimedia-Labs-Other, 6operations, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1102345 (10Earwig)