[00:00:00] nope [00:00:04] ah [00:00:06] so it did? [00:00:12] yuvipanda: heh, you op here, so just do it :O [00:00:23] yuvipanda: nope, that was more than half an hour ago [00:00:29] he left voluntary [00:00:39] but he is banned now [00:00:47] (at all wikimedia chans) [00:02:24] ah right [00:02:24] ok :D [00:02:56] wm-bot can do that. [00:03:04] It needs to be enabled though. [00:03:16] @op-toolson [00:03:27] @op-tools-on [00:03:27] It's something like that.. [00:03:34] lemme look [00:03:38] I know: add, changepass, channel-info, channellist, commands, configure, drop, github-, github+, github-off, github-on, grant, grantrole, help, info, instance, join, language, notify, optools-off, optools-on, optools-permanent-off, optools-permanent-on, part, rc-ping, rc-restart, reauth, recentchanges-off, recentchanges-on, reload, restart, revoke, revokerole, seen, seen-host, seen-off, seen-on, seenrx, suppress-off, suppress-on, systeminfo, system-rm, time, traffic-off, traffic-on, translate, trustadd, trustdel, trusted, uptime, verbosity--, verbosity++, wd, whoami [00:03:38] @commands [00:03:51] Operator tools were already enabled on this channel [00:03:51] @optools-on [00:03:56] I trust: .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@wikimedia/Ryan-lane (2admin), .*@wikipedia/.* (2trusted), .*@nightshade.toolserver.org (2trusted), .*@wikimedia/Krinkle (2admin), .*@[Ww]ikimedia/.* (2trusted), .*@wikipedia/Cyberpower678 (2admin), .*@wirenat2\.strw\.leidenuniv\.nl (2trusted), .*@unaffiliated/valhallasw (2trusted), .*@mediawiki/yuvipanda (2admin), .*@wikipedia/Coren (2admin), [00:03:56] @trusted [00:04:00] OK. :D [00:04:26] It's already enabled. [00:04:41] If it wasn't, then I would have enabled it. [00:04:56] yep, but only ryan, krinkle, cyberpower, yuvi pr coren can use optools [00:04:57] The command is @kickban [00:05:06] kb is enoug I think [00:05:19] kickbann is not listed at meta [00:05:25] *or [00:05:31] @kickban [00:05:50] @kb [00:06:08] guess he blocks, because the parameters are not set [00:06:11] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410687 (10zhuyifei1999) >>! In T136712#2410605, @tom29739 wrote: > That's another set of results, but note that the version of pip on my tool is pip 1.5.4 when @zhuyifei1999's results were with pip 8.1.2, so thi... [00:17:07] Luke081515, I can use optools too, but I don't unless it's absolutely necessary. [00:18:07] hm, how? actually you're not at the wm-bot acl [00:18:17] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410706 (10tom29739) >>! In T136712#2410687, @zhuyifei1999 wrote: >>>! In T136712#2410605, @tom29739 wrote: >> That's another set of results, but note that the version of pip on my tool is pip 1.5.4 when @zhuyife... [00:18:43] You are admin identified by name .*@wikipedia/tom29739 [00:18:43] @whoami [00:18:52] Luke081515, ^ [00:19:16] hm, why you are not at the lsit from @trusted? [00:19:19] *list [00:19:25] I'm a global admin on wm-bot, it won't show up on that list. [00:19:30] ah [00:21:28] I don't use optools though, because I feel it undermines the channel admins. If I were to go around changing channel configs, kicking/banning users whenever I felt like it, then I'd get very unpopular, very quickly. :D [00:21:58] I'd get my powers taken off me too most probably. [00:22:32] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410709 (10yuvipanda) On running strace for pip -V (which took about 20s), I get: ``` $ strace -o /tmp/pip-strace-counts -f -c pip -V % time seconds usecs/call calls errors syscall ------ -----------... [00:40:47] I trust: .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@wikimedia/Ryan-lane (2admin), .*@wikipedia/.* (2trusted), .*@nightshade.toolserver.org (2trusted), .*@wikimedia/Krinkle (2admin), .*@[Ww]ikimedia/.* (2trusted), .*@wikipedia/Cyberpower678 (2admin), .*@wirenat2\.strw\.leidenuniv\.nl (2trusted), .*@unaffiliated/valhallasw (2trusted), .*@mediawiki/yuvipanda (2admin), .*@wikipedia/Coren (2admin), [00:40:47] @trusted [00:45:10] tom29739: I still remember the request of the second wm-bot global admin "P [00:45:15] *:P [04:32:53] 10Tool-Labs-tools-Other, 07I18n: [[Wikimedia:Video2commons-loading]] i18n issue - https://phabricator.wikimedia.org/T138805#2410835 (10Liuxinyu970226) [05:02:34] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/UkrFace was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=692508 edit summary: [06:34:40] 10Tool-Labs-tools-Other, 07I18n: [[Wikimedia:Video2commons-loading]] i18n issue - https://phabricator.wikimedia.org/T138805#2410901 (10zhuyifei1999) a:03zhuyifei1999 The word "loading" in loading screens being all caps isn't uncommon. Please justify it being rude. [06:37:55] 10Tool-Labs-tools-Other, 07I18n: [[Wikimedia:Video2commons-loading]] i18n issue - https://phabricator.wikimedia.org/T138805#2410907 (10Liuxinyu970226) >>! In T138805#2410901, @zhuyifei1999 wrote: > The word "loading" in loading screens being all caps isn't uncommon. Please justify it being rude. Can't you ` 10Tool-Labs-tools-Other, 07I18n: [[Wikimedia:Video2commons-loading]] i18n issue - https://phabricator.wikimedia.org/T138805#2410915 (10zhuyifei1999) {F4207939} Doesn't look normal, does it? [07:29:20] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410950 (10yuvipanda) While it seems terrible that it's doing ~4k open calls for a -V, I guess the problem is just 'NFS has latency'? [07:40:20] 10Tool-Labs-tools-Other, 07I18n: [[Wikimedia:Video2commons-loading]] i18n issue - https://phabricator.wikimedia.org/T138805#2410952 (10zhuyifei1999) 05Open>03Resolved Considering [[https://translatewiki.net/wiki/Template:Identical/Loading|Template:Identical/Loading]], I've changed the message to "Loading..... [07:44:19] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410956 (10zhuyifei1999) I remember it was much faster in 2013. [07:45:21] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410957 (10yuvipanda) Indeed, or even a few months ago. The only significant thing I can think of as having changed since then is the NFS rate limiting, so I'd still like to maybe completely remove that and try t... [07:49:18] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410958 (10valhallasw) Yes, and 'we don't cache anything'. See the task description for a comparison between `tools-bastion-03` and `relic`. Your data suggests that an open() call only takes 350µs (which is reall... [07:50:26] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410959 (10tom29739) It was much faster about 5-6 months ago, when I joined. [07:53:41] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410960 (10tom29739) >>! In T136712#2410957, @yuvipanda wrote: > Indeed, or even a few months ago. The only significant thing I can think of as having changed since then is the NFS rate limiting, so I'd still lik... [08:01:18] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410961 (10zhuyifei1999) >>! In T136712#2410959, @tom29739 wrote: > NFS was much faster about 5-6 months ago, when I joined. It's having an impact on the speed of other things too (e.g. the speed of webservices,... [08:02:16] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2410963 (10valhallasw) >>! In T136712#2410960, @tom29739 wrote: >>>! In T136712#2410957, @yuvipanda wrote: >> Indeed, or even a few months ago. The only significant thing I can think of as having changed since th... [08:06:57] yuvipanda: is it not possible to cache NFS data or something? [08:08:00] or lazy-load nfs [08:41:30] zhuyifei1999_ https://phabricator.wikimedia.org/T106170 is why we turned off caching, although I agree that ticket needs a re-look [08:42:40] k [08:44:02] turning off cache doesn't seem to fix it though [08:45:54] yeah [08:46:42] one thing we could try is to spin up another bastion host without that turned off and see how that fares, but there's a moratorium on new instances until we get the new virt nodes racked [08:49:59] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: graphite.wmflabs.org is very slow / flaky - https://phabricator.wikimedia.org/T127957#2410994 (10yuvipanda) [08:50:01] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Disable diamond metrics for most labs projects temporarily - https://phabricator.wikimedia.org/T137753#2410992 (10yuvipanda) 05Open>03Resolved This has been re-enabled with new disks. [09:45:45] 06Labs, 10Horizon: Clicking on a project name logs you out of Horizon - https://phabricator.wikimedia.org/T138809#2411037 (10Tgr) [09:49:30] 06Labs, 10Horizon: Clicking on a project name logs you out of Horizon - https://phabricator.wikimedia.org/T138809#2411060 (10Tgr) [12:31:56] one thing about the nfs 'ratelimiting', it's not actaully ratelimiting as in every user has n throughput, it's more like a cap each host can use. So assuming activity on a host is under the throughput threshold there should be no difference in behavior. i.e. it doesn't matter until saturation [12:33:38] so it would be unlikely that virtualenv as a thing would be affected univserally by this mechanism, metadata caching issues seem more likely and unrelated to tc [13:15:10] 06Labs, 10Horizon: Clicking on a project name in Identity logs you out of Horizon - https://phabricator.wikimedia.org/T138809#2411317 (10Krenair) [13:48:58] chasemp hmm, the reason I suspect it is timing, since I remember virtualenvs working fine after the metadata caching change... [13:50:10] I'm not saying it definitely could never be that, but it doesn't make sense to me either [13:50:22] yuvipanda: drop onto bastion-10 and disable tc [13:50:26] and give it a go [13:50:29] or let me know what exactly to try [13:52:11] chasemp let me wriete full repro steps first [13:53:34] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411397 (10yuvipanda) Exact repro steps for me is, on any bastion host: ``` become lolrrit-wm time ./testing-virtualenv/bin/pip -V ``` [13:53:52] chasemp how do I disable tc? [13:54:47] I'm reading through man tc just now [13:57:19] yuvipanda: I think [13:57:31] tc qdisc del dev eth0 root [13:57:32] tc qdisc del dev eth0 handle ffff: ingress [13:57:32] tc [13:57:34] qdisc del dev ifb0 root [13:57:52] ok [13:58:09] doing those on bastion-10 [13:58:35] chasemp how do I put the limits back on? [13:58:47] run the tc script in /usr/loca/sbin or puppet would do it [13:59:03] that script essentially does that and then puts in limits [13:59:11] so no matter where you start from you end up at x [13:59:16] i.e. sort of idempotency [14:00:10] ah ok [14:00:35] ok tools-bastion-10 is in some strange state, puppet disabled and 'sudo: unable to initialize policy plugin' [14:00:42] so I'm going to try it on bastion-03 directly... [14:01:01] since we have a way of getting it back [14:02:59] Sure but on 10 no idea why and nothing there needs to be preserved [14:03:21] my laptop froze here so on phone trying to deal [14:03:28] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411426 (10yuvipanda) Ok, with @chasemp's help I cleared the tc settings on bastion-03: ``` tc qdisc del dev eth0 root tc qdisc del dev eth0 handle ffff: ingress tc qdisc del dev ifb0 root ``` running pip -V st... [14:03:31] ah ok [14:03:52] chasemp so you're right and I'll have to look at attribute cache [14:04:02] chasemp ok I'll run puppet on -10 now [14:08:42] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411442 (10tom29739) Perhaps ulimits are causing it? ``` tom29739@tools-bastion-03:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority... [14:16:19] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411449 (10tom29739) >>! In T136712#2410709, @yuvipanda wrote: > On running strace for pip -V (which took about 20s), I get: > > ``` > $ strace -o /tmp/pip-strace-counts -f -c pip -V > % time seconds usecs/... [14:17:48] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411450 (10chasemp) best way to eliminate is to edit /etc/security/limits.conf, logout,login, verify nee limits and test [14:19:27] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411455 (10yuvipanda) No effect. ``` tools.lolrrit-wm@tools-bastion-03:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 fi... [14:54:53] 06Labs, 10Tool-Labs: Virtualenvs slow on tool labs NFS - https://phabricator.wikimedia.org/T136712#2411483 (10yuvipanda) I finally fixed up the test bastion host enough, and @valhallasw is right as usual - turning off lookupcache=none speeds this up to 2s. Removing nofsc had no effect though. [15:03:36] yuvipanda: what's lookupcache? [15:04:34] zhuyifei1999_: http://linux.die.net/man/5/nfs ctrl-f lookupcache [15:04:42] zhuyifei1999_ ctrl-f 'lookupcache' in http://man7.org/linux/man-pages/man5/nfs.5.html [15:04:51] thx [15:04:57] basically it caches stat and similar calls I think, but that might mean if they are changed elsewhere you might not see that immediately [15:11:36] 06Labs, 10Tool-Labs, 10DBA, 06Discovery, 06Maps: Some users use persistent connections that are idle, wasting memory and other resources that could be used for other users - https://phabricator.wikimedia.org/T138283#2411934 (10PeterBowman) I've disabled connection pooling by enforcing the number of idle... [15:25:46] !log tools Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs [15:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:26:46] bd808 uh, I thought that node never came up. [15:26:55] apparently it finally did [15:27:11] let me pool that then [15:27:41] I got an email about failed puppet run from it ~7 hours ago [15:27:50] forced run after signing cert [15:28:53] bd808 btw I think I moved all your webservices to k8s [15:29:07] neat [15:30:03] 06Labs, 10Tool-Labs, 10DBA, 07Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#2412016 (10jcrespo) [15:30:06] 06Labs, 10Tool-Labs, 10DBA, 06Discovery, 06Maps: Some users use persistent connections that are idle, wasting memory and other resources that could be used for other users - https://phabricator.wikimedia.org/T138283#2412013 (10jcrespo) 05Open>03Resolved a:03jcrespo @PeterBowman Thank you for taking... [15:36:08] yuvipanda valhallasw`cloud: what if actimeo=0,lookupcache=positive ? [15:37:14] (or with actimeo=1) [15:39:04] we could try that [15:39:06] let me try it [15:41:19] zhuyifei1999_ (IRC): 34s for that! [15:41:54] that's actimeo=0? :/ [15:42:05] We want it faster, not slower :D [15:42:29] yeah [15:42:40] labstore.svc.eqiad.wmnet:/project/tools/project on /data/project type nfs (rw,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,actimeo=0,lookupcache=positive,nofsc,hard,addr=10.64.37.10,clientaddr=10.68.23.98) [15:42:51] I guess that means actimeo=0 == noac [15:43:03] Why don't you get rid of the lookupcache option and find another way to solve the attribute cache problem? [15:43:27] tom29739: how exactly? [15:43:42] That option probably won't be the only way to solve the problem/ [15:43:58] yuvipanda: what if it's 1? http://www.sebastien-han.fr/blog/2012/12/18/noac-performance-impact-on-web-applications/ says 1 is faster [15:44:41] 'Using the noac option provides greater cache coherence among NFS clients accessing the same files, but it extracts a significant performance penalty. As such, judicious use of file locking is encouraged instead. The DATA AND METADATA COHERENCE section contains a detailed discussion of these trade-offs.' [15:44:49] From http://linux.die.net/man/5/nfs [15:45:23] also are most stat() calls failing? they might fall into the "negative" category [15:45:36] That might make it slower. [15:46:09] tom29739: yes noac is slower than the current setup [15:46:22] even man page says so [15:47:01] lock/nolock - Selects whether to use the NLM sideband protocol to lock files on the server. If neither option is specified (or if lock is specified), NLM locking is used for this mount point. When using the nolock option, applications can lock files, but such locks provide exclusion only against other applications running on the same client. Remote applications are not affected by these locks. [15:47:44] Maybe set lock and get rid of the lookupcache option? [15:48:59] lock? [15:49:08] lock=1 [15:49:22] what does that have to do with attribute cache? [15:50:00] noac is making the command slow. [15:50:23] The man page suggests 'As such, judicious use of file locking is encouraged instead' [15:52:07] tom29739 I don't see 'lock' in http://man7.org/linux/man-pages/man5/nfs.5.html [15:52:21] stat'ing across NFS is making it slow. There are really not many good ways to fix that except adding more bandwidth to the NFS server and tuning IOPS on the NFS server itself. [15:52:23] bah nvm [15:52:29] but I don't think pip use locks? [15:53:14] well, if I set lock it has same performance as not specifying lookupcache=none [15:53:16] bd808: yes, but there should be some way to reduce the % of the stats that actually to to the NFS [15:53:23] I think that doesn't matter... [15:53:27] *go to [15:53:44] It looks like lock is only on NFS 2 + 3 [15:54:32] and the larger the %, the slower it is, the less likely to have attribute cache issues [15:54:33] zhuyifei1999_: only for repeat stat calls that have already been cached locally or by making fewer stat calls in the first place [15:54:48] ' stat'ing across NFS is making it slow. There are really not many good ways to fix that except adding more bandwidth to the NFS server and tuning IOPS on the NFS server itself.' - the lookupcache=none option makes more network requests to the NFS server. [15:55:49] yes, but lookupcache=none => (nearly ?) always revalidate [15:56:44] Analyse the network requests made when lookupcache=none vs no lookupcache option maybe? [15:57:53] (and I wonder how precise don't have this issue; I neither experienced nfs slowness nor attribute cache issue during the pmpta time) [15:57:53] 9s on jessie (vs 16s) [15:58:15] wow [15:58:27] the way NFS caching works is different on precise and trusty. [15:58:30] pmpta made a lot less use of nfs generally [15:58:46] and the use to tools labs has grown radically in the last 2.5 years [15:59:46] and recollections from years ago are not highly trustable ;) [16:00:01] yeah we need to be pretty cautious about changing nfs options as it's going to cause issues somewhere no matter what we do [16:00:03] https://phabricator.wikimedia.org/T106170#1649619 <= (c) I am unable to reproduce those symptoms on Precise. --coren [16:00:10] next someone will suggest we switch back to gluster [16:00:31] gluster? [16:00:34] What was wrong with that? [16:00:38] I /think/ lookupcache can be changed w/ remount which means puppet change and not a reboot [16:00:43] but even that needs exploring [16:00:47] zhuyifei1999_, http://www.gluster.org/ [16:00:54] with gluster? it completely melted under load. [16:01:10] NFS does too. [16:01:10] and we switched everything to nfs [16:01:25] I've experienced at least 2 nfs outages due to load. [16:01:44] getting rid of SGE and it's need for NFS is our best bet for a happy future [16:02:02] well, our k8s model still needs nfs as well [16:02:06] *all* distribtued file systems suck [16:02:18] it's a bad solution to a hard problem [16:02:22] (s) [16:02:23] tom29739: nfs outage = countless [16:02:40] bd808: including openstack swift? [16:02:57] swift is not a shared file system, it's object storage [16:02:59] totally different [16:03:16] fine [16:03:28] Hi, I can't access glamtools, is it down? [16:04:22] glamtools loading forever here [16:04:59] bd808 chasemp getting rid of gridengine also gets rid of trusty so we can definitely kill lookupcache=none [16:05:16] Happy days :) [16:05:24] yuvipanda: why do we not need it on jessie? [16:05:34] Jessie doesn't have the issue. [16:05:41] why is that? [16:05:44] hmm [16:05:46] zhuyifei1999_: same [16:06:01] Precise doesn't either. [16:06:05] No idea who or what would be responsible for finding out why [16:06:07] k8s fires up new containers all the time? [16:06:17] chasemp: ^ [16:06:28] !log tools.glamtools Restarted webservice [16:06:34] I actually don't know [16:06:38] zhuyifei1999_ yup it does [16:07:08] I vaguely recall the task where coren set lookupcache to help w/ consistency across clients I think [16:07:13] or set it to none even [16:07:21] That's the idea of that option. [16:07:22] but I have no idea of how that plays out on jessie w/ containers honestly [16:07:29] chasemp https://phabricator.wikimedia.org/T106170 [16:07:40] containers don't come into the picture - they know nothing of NFS [16:07:48] they just mount from the host, so we only have to care about jessie [16:07:54] on jessie the same test takes 9s vs 16s on trusty [16:07:57] with same settings [16:08:23] right but number of clients and race conditions w/ accessing comes into play doesnt it? [16:08:33] I need to read the tasks and back reasoning [16:08:37] I honestly just don't know [16:08:58] yuvipanda: do you mean the container mounts the nfs mounts on the host? [16:09:01] yeah, the jessie nodes are fairly non-overloaded, but nobody's racing for this set of files tho [16:09:02] replace number of clients w/ concurrently accessing procs [16:09:14] zhuyifei1999_ yup. we just mount the host's /data/project onto the container [16:09:22] I think remaings to be seen what effect transitioning will have but [16:09:27] lookupcache=none also puts load on teh server [16:09:33] so finding a way to not have it off would be nice [16:09:39] yeah [16:09:49] wouldn't the kernel aggressively cache the attributes since it's the same kernel? [16:10:23] in the current case? it would except we turn it off with lookupcache=none [16:10:39] I mean on k8s [16:10:51] jessie stuff [16:11:03] on k8s it'll be the exact same situation we have now wrt nfs. nothing new, except we'll be using a newer kernel with more bug fixes (jessie) [16:11:12] so whatever bugs people ran into on the trusty kernel are probably gone [16:11:20] k [16:11:22] although I'm kinda boggled we didn't try just updating the kernel first [16:11:30] 90% of nfs issues are clietn side outside of straight overload [16:11:38] which we have at least under a modicum of control atm [16:12:01] KaisaL: restarting the web service for glamtools didn't seem to make any difference. sorry. :? [16:12:37] KaisaL: http://tools.wmflabs.org/?tool=glamtools <= contack magnus [16:12:40] *contact [16:36:52] zhuyifei1999_: https://commons.wikimedia.org/wiki/User_talk:Magnus_Manske#Glamtools_is_Down [16:36:55] Have asked him [17:05:57] re: nfs: the reason lookupcache was turned off was because the static file server borked every now and then, and this was fixed by a ls [17:06:22] I think we can safely turn it on for bastions -- that's also where the performance hit is the most relevant (from a user perspective) [17:06:42] because on a bastion 'run ls' is a reasonable response to 'I don't see my file' [17:07:00] right [17:07:05] so possible solution to [17:07:13] this is to make that option configurable [17:08:55] valhallasw`cloud: what about the generic (i.e. non-lighttpd) webservice modes? [17:09:05] zhuyifei1999_: what about them? [17:09:21] my pain point isn't pip -V too slow, but webservice restart too slow [17:10:13] due to the python imports [17:10:50] the solution to that would be k8s. I'm pretty sure 'slow restart' is preferrable over 'somehow python imports fail' [17:12:06] ( https://phabricator.wikimedia.org/T106170 ) [17:14:39] * zhuyifei1999_ strace python -c 'import flask' 2>&1 > /dev/null | less [17:15:13] 06Labs, 10Tool-Labs, 10DBA, 06Discovery, 06Maps: Some users use persistent connections that are idle, wasting memory and other resources that could be used for other users - https://phabricator.wikimedia.org/T138283#2412435 (10PeterBowman) Hi, @jcrespo, thank you for your quick response. I'm just startin... [17:16:16] https://www.irccloud.com/pastebin/RAEh37dz/ [17:16:31] it stat() the dirs before reading the files [17:19:39] zhuyifei1999_: as I read the task, it's about stat'in /files/. But even if it isn't, apparently ruby /is/ affected [17:21:02] yes, but wouldn't stat()ing the dir refresh the stat() of files? (I might be wrong here) and /me reads about ruby [17:21:29] oh ruby... [17:22:02] fine then [17:23:00] a sad tale about how horrible ruby and stat() can be -- http://bd808.com/blog/2014/09/30/puppet-file-recurse-pitfall/ [17:23:25] 2,293,677 during one puppet run [17:24:05] bd808: sudo puppet agent -tv was super slow for me. I straced it, and apparently it was reading my ~/.ruby/whatever over NFS >_< [17:24:35] yup. and puppet does pretty much non caching of the found files [17:24:48] yup, I learnt the -i trick from valhallasw`cloud I think [17:26:21] for every class in the graph it will check all of the ruby path locations to see if there is a pure ruby implementation of the resource. It repeats this for each and every resource in the graph even if you have say 135k resources of the same type [17:27:01] >_< [17:27:38] valhallasw`cloud: sudo -H should workaround that [17:27:41] https://tickets.puppetlabs.com/browse/PUP-2924 [17:31:13] bd808: https://phabricator.wikimedia.org/T137736 [17:31:20] want a fresh ticket? [17:31:47] also, maybe time to replace it with a fresh/more-RAM instance? [17:32:43] Labs is pretty tight on space for instances right now. Hopefully that will be fixed in the next week or so when new hardware comes online [17:35:13] hmmm... "Error: You are not allowed to start instance: phlogiston-1" [17:36:08] want me to kick it from nova? [17:36:16] yuvipanda: yes please [17:36:59] "internal error: No PCI buses available" sounds spooky [17:37:09] bd808: Actually I have one not needed medium instance. Should I delete it? ;) [17:37:18] bd808 yup, no luck [17:37:37] Luke081515: yes. every little bit helps [17:37:42] bd808 https://phabricator.wikimedia.org/T137857 [17:38:03] done [17:38:09] thanks Luke081515 [17:40:14] 06Labs: Labs instances failing with "internal error: No PCI buses available" - https://phabricator.wikimedia.org/T137857#2381540 (10bd808) The phlogiston-1 instance has managed to reproduce the issue a second time after being moved from labvirt1001 to labvirt1009. [17:41:07] jaufrecht: didn't we setup the phlogiston-2 instance so you could get rid of phlogiston-1? [17:41:15] * bd808 can't remember [17:41:45] bd808: I need both: one production, one dev. [17:43:50] related thoughts: 1) It's not hard to re-install phlogiston, so that shouldn't be a barrier to new/better server. 2) on phlog-2, it maxes out CPU while only using a quarter of the RAM; could probably benefit from DBA attention to optimize Postgresql, if such is available. 3) TPG tried to decide this quarter how formally to treat Phlogiston and what level [17:43:50] of support to ask for, but couldn't make the decision yet [17:48:09] you can certainly try to bring up a new host and hope it doesn't hit the same bug. Like I mentioned we are having some space issues at the moment so the smaller the better [17:48:37] I don't know any Postgres DBAs to point you to for help [17:49:37] TPG can ask for support, but I'm not sure where its going to come from. There aren't any WMF staff dedicated to running services in Labs for other teams AFAIK [17:50:16] 06Labs, 10Phlogiston: Phlogiston-2 failing with out of disk space error - https://phabricator.wikimedia.org/T138867#2412515 (10JAufrecht) [17:50:24] 06Labs, 10Phlogiston: Phlogiston-2 failing with out of disk space error - https://phabricator.wikimedia.org/T138867#2412531 (10JAufrecht) p:05Triage>03Unbreak! [17:50:28] moving it to production would be possible, but will take a bit of work (Puppet for everything and security review) [17:50:45] community tech! [17:50:53] * bd808 trouts yuvipanda [17:51:24] I really can't even imagine how commtech and burn down charts coincide [17:52:08] I was just asking for a trouting, mostly [17:52:34] don't make me show you the scroll ;) [17:53:52] :D [17:57:19] 06Labs, 10Phlogiston: Phlogiston-2 failing with out of disk space error - https://phabricator.wikimedia.org/T138867#2412538 (10JAufrecht) Forgot to turn off Postgresql full sql statement logging, leading to 4.5Gb of log files. [17:58:12] 06Labs, 10Phlogiston: Phlogiston-2 failing with out of disk space error - https://phabricator.wikimedia.org/T138867#2412539 (10JAufrecht) 05Open>03Resolved [18:13:22] (03PS1) 10Luke081515: Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 [18:13:50] (03PS2) 10Luke081515: Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 [18:15:00] (03PS3) 10Luke081515: Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 [18:17:27] (03PS4) 10Luke081515: Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 [18:24:16] (03PS1) 10Luke081515: Adjust url[en/de]code tools [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296427 [18:26:18] (03CR) 10Luke081515: [C: 04-1] Adjust url[en/de]code tools [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296427 (owner: 10Luke081515) [18:26:52] (03PS2) 10Luke081515: Adjust url[en/de]code tools [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296427 [18:29:00] (03Abandoned) 10Luke081515: Adjust url[en/de]code tools [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296427 (owner: 10Luke081515) [18:32:58] (03CR) 10Luke081515: [C: 031 V: 031] Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 (owner: 10Luke081515) [20:06:50] 10Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2412801 (10Dvorapa) Look on the screenshot. There are queries marked as running under queries marked as completed. This looks broken at least [20:08:51] yuvipanda hi, could you have a look at phab-01 since it failed to reboot. [20:08:52] please [20:10:42] 06Labs, 10Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2412816 (10Paladox) [20:10:46] 06Labs, 10Phabricator: https://phab-01.wmflabs.org returns a core exception - https://phabricator.wikimedia.org/T137270#2412814 (10Paladox) 05Resolved>03Open Re opening this. I tried ssh into it but failed I looked at the website and it showed http 500 error so I rebooted it but now it failed to reboot.... [20:54:18] (03CR) 10Luke081515: [C: 032 V: 032] Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 (owner: 10Luke081515) [20:59:42] (03Merged) 10jenkins-bot: Implement hook "remove" [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296425 (owner: 10Luke081515) [21:18:09] 06Labs, 10Phabricator: Applying role role::phabricator::main causes errors on instances - https://phabricator.wikimedia.org/T138881#2412999 (10Paladox) [21:29:13] (03PS1) 10Lokal Profil: Recreate source table prior to run [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/296466 (https://phabricator.wikimedia.org/T138606) [21:38:45] (03PS1) 10Lokal Profil: Add .gz to .gitignore [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/296468 [21:39:30] (03PS2) 10Lokal Profil: Recreate source table prior to run [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/296466 (https://phabricator.wikimedia.org/T138606) [21:50:49] (03PS1) 10Luke081515: Implement hooks for logging [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296470 [21:51:06] (03CR) 10jenkins-bot: [V: 04-1] Implement hooks for logging [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296470 (owner: 10Luke081515) [21:51:16] grmpf [21:52:48] (03PS2) 10Luke081515: Implement hooks for logging [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296470 [21:55:54] yuvipanda: Can I safely delete the docker nagf branch? [21:56:06] krinkle yes [21:56:24] Thx [21:56:38] [13nagf] 15Krinkle 04deleted 06docker at 14ab73221: 02https://github.com/wikimedia/nagf/commit/ab73221 [21:57:20] yw. thanks for patience, Krinkle [21:58:36] yuvipanda [21:58:41] it seems we carn't [21:58:46] ssh into some phab-instances [21:58:55] I'll take a look in a few minutes [21:59:01] such as phab-01 and phab-04 due to hardware failures [21:59:14] yuvipanda: btw, does the default tool labs k8s config include health checks? [21:59:14] and creating new instances still stop us ssh into them [21:59:22] for webservice that is [21:59:37] krinkle http health checks? nope but I've a local patch adding it [21:59:43] basically adds mod_status to lighttpd by default [21:59:47] and then uses tha for health check [21:59:48] I recall you added it manually for Nagf in the past [21:59:54] since /$toolname might be expensive [21:59:55] which I assume it now effectively lost. [22:00:10] yuvipanda: Yeah, makes sense. [22:00:28] I might allow overrides too [22:00:34] Though I'd like to have the option to make it check /$tool instead. [22:00:41] To better catch problems. [22:01:03] It depends though, it'll most likely catch bad deployments, which would affect it regardless of restarting it. [22:01:23] but if cache corrupts, a new pod would fix that [22:02:22] right [22:02:32] but I think the mod_status is a good start [22:02:42] and then I can maybe make it an option for webservice [22:03:36] 10Tool-Labs-tools-stewardbots, 06Stewards-and-global-tools: Bring back requests.php - https://phabricator.wikimedia.org/T130028#2413083 (10MarcoAurelio) 05Open>03stalled p:05Normal>03Low [22:04:09] (03Abandoned) 10MarcoAurelio: Bring back requests.php [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/280695 (https://phabricator.wikimedia.org/T130028) (owner: 10MarcoAurelio) [22:04:23] 10Labs-Other-Projects: bugs.wmflabs: bad gateway bugzilla.wmflabs: can't connect to db - https://phabricator.wikimedia.org/T138883#2413101 (10Dzahn) [22:04:36] 10Labs-Other-Projects, 10Wikimedia-Bugzilla: bugs.wmflabs: bad gateway bugzilla.wmflabs: can't connect to db - https://phabricator.wikimedia.org/T138883#2413113 (10Dzahn) [22:07:26] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/8e1e032a556d4192bab58d7a9c61dc1c687fed29 [22:07:26] 13nagf/06master 148e1e032 15Timo Tijhof: graphs: Hide empty time series in wildcard expansion... [22:08:26] yuvipanda we tryed rebooting, i get [22:08:26] MessageUnavailable console type serial. [22:08:45] Messageinternal error: No PCI buses available [22:09:02] phab-01 has been busted for months [22:09:37] bd808, i suggested moving it to a medium instance but it is happenning to all instances [22:09:42] on the phab project if rebooted [22:10:19] that no pci buses error is worth adding to T137857 [22:10:19] T137857: Labs instances failing with "internal error: No PCI buses available" - https://phabricator.wikimedia.org/T137857 [22:10:27] [13nagf] 15Krinkle closed pull request #11: Test travis on php 5.5 and 5.6 and 5.7 (06master...06patch-2) 02https://github.com/wikimedia/nagf/pull/11 [22:11:00] bd808 ok [22:11:06] phab-03 worked the last time I needed to test some stuff [22:11:27] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/082c663d04eaf003f2a1d78cc3f0949c655314d5 [22:11:27] 13nagf/06master 14082c663 15Timo Tijhof: build: Add hhvm, php 5.5, and php 5.6 to Travis [22:11:27] bd808, yep it does but if i restart it, it will fail [22:12:17] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/7b14f957914f8b298f814d2f5e08b615a6ee9ce3 [22:12:17] 13nagf/06master 147b14f95 15Timo Tijhof: build: Enable container optimisation for Travis... [22:18:45] (03CR) 10Luke081515: [C: 04-2 V: 04-1] "Works not. I will do this later." [labs/tools/Luke081515IRCBot] - 10https://gerrit.wikimedia.org/r/296470 (owner: 10Luke081515) [22:21:32] wikimedia/nagf#46 (master - 082c663: Timo Tijhof) The build was broken. - https://travis-ci.org/wikimedia/nagf/builds/140936096 [22:23:06] [13nagf] 15Krinkle closed pull request #10: Update jakub-onderka/php-parallel-lint to 0.9.* and squizlabs/php_cod… (06master...06patch-1) 02https://github.com/wikimedia/nagf/pull/10 [22:24:51] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/1319b0bb0d4f568537967ee7a157921592e791bc [22:24:52] 13nagf/06master 141319b0b 15Timo Tijhof: build: Upgrade php-parallel-lint to 0.9.x... [22:26:23] wikimedia/nagf#47 (master - 7b14f95: Timo Tijhof) The build was broken. - https://travis-ci.org/wikimedia/nagf/builds/140936265 [22:29:32] wikimedia/nagf#48 (master - 1319b0b: Timo Tijhof) The build was fixed. - https://travis-ci.org/wikimedia/nagf/builds/140939329