[01:43:24] 3Wikimedia Labs / 3wikitech-interface: Thumbnails for Commons images are not displayed - 10https://bugzilla.wikimedia.org/72245 (10Tim Landscheidt) 3NEW p:3Unprio s:3normal a:3None On https://wikitech.wikimedia.org/w/index.php?title=User_talk:Automatik&oldid=131482, the thumbnail isn't displayed (htt... [02:33:53] PROBLEM - ToolLabs: Low disk space on / on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-01.diskspace.root.byte_avail.value (10.00%) [06:10:24] 3Wikimedia Labs: wildcard release parameter fails with jsub - 10https://bugzilla.wikimedia.org/72248 (10merl) 3NEW p:3Unprio s:3major a:3None jobs with wildcard release resource parameters are not interpret correctly if i am using jsub instead of qsub My testfile foo.sh: #$ -l release='*' #$ -w v echo... [07:47:30] !log deployment-prep Converted I-000006ad.eqiad.wmflabs to use local puppet & salt masters [07:47:43] not really :) [08:34:49] no morebots? [08:39:25] hashar: can you check if I-000006ad.eqiad.wmflabs is okay for local puppet & salt masters? It doesn't get patch I cherry pick at deployment-salt [08:39:38] if possible, let me know how to debug :) [08:40:35] kart_: morning :D [08:41:12] hashar: Good Morning! [08:41:19] kart_: in wikitech when looking at the list of instance at https://wikitech.wikimedia.org/wiki/Special:NovaInstance there is a "puppet status" column [08:41:42] if the instance has a puppet status of "OK" it stills point to the labs puppetmaster instead of the beta cluster puppet master [08:42:04] Noted. [08:42:22] does puppet pass on deployment-apertium02 ? [08:42:55] when ever you save the configuration of an instance, it is saved (in LDAP I think) instantly [08:43:06] so the next run of puppet should reflect the changes made in the web interface [08:43:31] another thing is that both puppet and salt uses certificates between the client and servers [08:43:46] hashar: it passes, but definately not from deployment-salt. [08:43:50] so when switching an instance to use the beta cluster puppet/salt masters, you have to manually confirm the client keys [08:44:01] hehe [08:44:12] hashar: yes. I did as mentioned on wikitech page. [08:44:17] great [08:44:21] (did that fine with cxserver03) [08:44:23] so puppet conf is in /etc/puppet/puppet.conf [08:44:33] and it shows agent.server = deployment-salt.eqiad.wmflabs [08:44:36] so that seems fine [08:45:07] yes [08:45:24] what is the patch you cherry picked on the puppet master ? [08:45:35] https://gerrit.wikimedia.org/r/#/c/165485/ [08:45:49] I expect something wrong with it now :) [08:45:56] have you applied role::apertium::beta on the apertium02 instance ? [08:46:39] another thing, it doesn't appear in config! [08:46:43] hehe [08:46:56] so, obviously, patch has issue. [08:46:58] so if the role class is not applied on the instance, it is never going to be realized by puppet :D [08:47:20] right. [08:47:44] in OpenStack manager, you would have to add the puppet class to the list of classes that can be applied on an instance [08:47:46] that is done at https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [08:48:04] under 'deployment-prep' section there is: Classes [Add class] [08:48:37] that would make "role::apertium::beta" available in the list of classes to apply on an instance [08:48:46] then configure the instance and check that newly available role::apertium::beta class [08:48:48] run puppet [08:48:50] solved :D [08:48:51] (hopefully) [08:50:11] :) [08:50:16] Thanks. Testing. [08:53:27] hashar: error on puppet run, I guess I will wait from Alex for confirming all packages availability. [08:53:46] hashar: Thanks for helping as usual ;) [08:53:57] (NovaPuppetGroup was new learning!) [08:54:29] hashar: if you're bored, feel free to review Apertium patch ;) [08:54:46] !ping [08:54:46] !pong [08:54:49] ok [08:58:05] kart_: make sure Alexandros is aware of the packaging need s:D [09:02:52] he is :) [09:15:15] !log tools cleaned out .svg files from /tmp on tools-webgrid-01 [09:24:48] RECOVERY - ToolLabs: Low disk space on / on labmon1001 is OK: OK: All targets OK [10:19:40] PROBLEM - ToolLabs: Low disk space on / on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-01.diskspace.root.byte_avail.value (10.00%) [11:42:56] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [12:01:04] RECOVERY - ToolLabs: Low disk space on / on labmon1001 is OK: OK: All targets OK [12:17:31] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [12:45:10] 3Wikimedia Labs / 3tools: program created by proprietary compiler allowed on labs? - 10https://bugzilla.wikimedia.org/72253 (10Inkowik) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Hi, I am using a self-written framework (in https://en.wikipedia.org/wiki/PureBasic) for my bot running on my local mac... [12:47:07] 3Wikimedia Labs / 3tools: program created by proprietary compiler allowed on labs? - 10https://bugzilla.wikimedia.org/72253#c1 (10Yuvi Panda) I think it would be, as long as your code itself is open source. I'll wait for Coren to confirm, however. [13:04:40] 3Wikimedia Labs / 3tools: program created by proprietary compiler allowed on labs? - 10https://bugzilla.wikimedia.org/72253#c2 (10Marc A. Pelletier) Legal was pointed towards this bug, and their opinion on the matter is what we need so I am not going to express an opinion either way. :-) [15:07:24] andrewbogott: hello, lvs question for labs ? :) [15:07:32] matanya: what's up? [15:08:01] modules/lvs/manifests/configuration.pp refers quite a lot to pmtpa for labs [15:08:33] i guess it is moot by now, should i remove the entire ref to tamp/replace with eqiad/don't touch ? [15:09:35] matanya: That'd be fine, although we'll need to keep the switching logic in place for codfw. [15:09:52] Well… ideally it'll move change to using heira but having the switch there is a good reminder that it needs to switch [15:10:09] andrewbogott: that = what opetion ? one ? [15:10:45] sorry, I don't understand the question [15:11:03] I gave three options with what to do with that config [15:11:42] Oh… hm. [15:11:48] I guess don't touch, since that's easiest :) [15:12:11] I'm puzzled that the eqiad option for a lot of these settings is 'undef'. Does that mean that a lot of that config isn't used at all anymore? [15:14:16] andrewbogott: can you create a gerrit project for me when you have the time? operations/software/shinkengen [15:14:47] YuviPanda: with an empty commit, or are you going to import? [15:14:56] I'm going to import [15:15:40] ok. And… 'rights inherited from'? [15:16:18] andrewbogott: i'll leave it for now. thanks! [15:17:00] YuviPanda: rights? [15:17:09] andrewbogott: ah, hmm. I'd say operations/puppet [15:17:14] andrewbogott: but I can't merge then :| [15:17:26] yeah, I almost did that and then though, wait, that's not so good for yuvi [15:17:32] andrewbogott: can we make it ops/puppet but also add me? [15:17:38] Yeah, i think so [15:17:47] cool, can you do that? [15:17:58] two more weeks! [15:18:35] Oh, dammit, except that's done by group... [15:18:57] well, just add mediawiki/core as well? we can remove it later... [15:19:54] ok… mediawiki has push and submit. I think that's what you need to import... [15:19:59] ok.. [15:20:01] let me tr [15:20:01] y [15:21:52] andrewbogott: ! [remote rejected] master -> master (prohibited by Gerrit) :( [15:22:23] try again? [15:22:35] andrewbogott: same thing [15:23:03] you're in group 'mediawiki' right? [15:23:32] andrewbogott: yeah, I can merge there.. [15:23:36] so I should be? [15:23:42] Is there any other way to check? [15:25:04] ok, one more time? [15:25:19] doing [15:25:56] andrewbogott: yup, done! [15:25:57] thanks [15:26:07] cool. I'm going to remove some of these weirder permissions now [15:26:28] andrewbogott: cool [15:26:39] you should still have 'submit' [15:27:34] trying now [15:27:47] andrewbogott: nope, can't +2 [15:27:52] dammit [15:28:09] I guess +2 is different from submit? [15:28:13] !log updated salt master on virt1000 to 2014.1.11 (this is labs salt master) [15:28:15] yeah [15:28:16] probably [15:29:21] ok, I think I found +2 [15:29:57] yay [15:31:00] andrewbogott: so I can C+2, but can't V+2 or submit [15:31:18] Ah, not using jenkins I suppose [15:31:24] yeah [15:31:26] not yet [15:32:00] should be better now [15:32:19] andrewbogott: yay! thanks [15:46:31] andrewbogott: wanna merge https://gerrit.wikimedia.org/r/#/c/167591/? Tested, and affects shinken only [15:57:02] andrewbogott: what do you think bout the salt minions in the lab instances? should I mass upgrade them or not? it's not required [15:57:10] i.e. the master is backwards compat [15:57:36] apergos: Let me try on a test instance... [15:57:46] It'll be hard to do everywhere in any case, as salt is broken on many many instances [15:58:28] make sure it's running precise [15:58:42] I hven't pushed the lucid or trusty packages yet, that will be in another 20 mins or so [15:59:03] beta is running upgraded on all instances though [16:03:10] apergos: so, I think I don't care :) Probably fine to let it alone unless there are specific benefits we'll get from an upgrade. [16:03:31] bug fixes, oh... hmm... there's some grain related stuff [16:03:39] but if you'renot using trebuchet in there you won't care either [16:04:22] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [16:05:05] apergos: if you're feeling super ambitious… I have a report that detects which instances are salty and which aren't. So the full project would be to upgrade, and to investigate (and maybe fix) those that aren't responding. [16:05:20] eeerrrggghh [16:05:22] YuviPanda: Can you take care of ^^? You have root on labmon1001, right? [16:05:30] well I would be fine upgrading those that are responsive [16:05:34] andrewbogott: yeah, that's not on labmon, it's on tools-webproxy [16:05:44] I can do that by a test.ping and shovleling the names in there [16:05:47] YuviPanda: ? how so? [16:06:02] apergos: can't you just tell salt "upgrade everything" and what fails, fails? [16:06:13] well I could I guess [16:06:14] andrewbogott: it checks graphite, just is picked up by icinga from labmon. see the message - tools. is specified [16:06:41] YuviPanda: oh… that's confusing :) [16:06:43] but I'd rather just have failures from non responsive hosts to look at [16:06:48] andrewbogott: yeah, shinken should fix things up :) [16:06:51] instead of possibly fails from hosts that were working [16:07:01] and not know which ones were which [16:07:31] apergos: I think you should go ahead. If you notice any patterns (like, e.g. 'salt is broken on every instance created since August') then please let me know [16:07:40] yep [16:07:51] ok I'll get to that after I finsih up the prod cluster [16:08:04] then I'll have all the debs out there [16:08:07] apergos: you'll need to set up a trusty package first, lots of labs trusty instances. [16:08:14] ah, yeah, what you said :) [16:08:25] I stole these (just like ryan) from the salt ppa [16:13:05] BAM! http://shinken.wmflabs.org/host/tools-login [16:13:09] yay [16:13:45] andrewbogott: https://gerrit.wikimedia.org/r/#/c/167597/ and https://gerrit.wikimedia.org/r/#/c/167600/ [16:13:57] andrewbogott: already deployed, and things work! (see previous link) [16:14:54] YuviPanda: what is my login on shinken? [16:15:03] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [16:15:05] oh, nm :) [16:15:07] andrewbogott: guest/guest, we don't have working login yet [16:15:50] should there be a red line for /var on that page someplace? [16:16:05] andrewbogott: I don't have service checks yet. [16:16:10] ok [16:16:28] andrewbogott: shinkengen generates hosts.cfg now, will need to make it generate services [16:30:13] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (11.11%) [16:30:47] andrewbogott: ^ is flapping, since it's very close to the threshold when it rotates logs, I think. I'll fix in a while, shinken is close anywa [16:48:18] hmm, need to have shinken spam here [16:48:28] could either to irccecho or replace it properly... [16:48:29] hmm [16:53:33] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [17:15:28] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [17:16:28] PROBLEM - ToolLabs: Puppet freshness check on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-tomcat.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-04.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-03.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-01.puppetagent.time_since_last_run.value (100.00%) [17:16:39] Hmm. Odd. [17:17:54] Ah. Manifest error. Will be corrected when I'm ready for the next patch. [17:21:09] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [17:22:46] ACKNOWLEDGEMENT - ToolLabs: Puppet freshness check on labmon1001 is CRITICAL: CRITICAL: tools.tools-webgrid-tomcat.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-04.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-03.puppetagent.time_since_last_run.value (100.00%) tools.tools-webgrid-01.puppetagent.time_since_last_run.value (100.00%) Coren Known issue in the manifest will be cleaned up by an upcoming [17:35:58] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [17:38:14] Coren: https://gerrit.wikimedia.org/r/#/c/167623/ <- if this entirely blocks instance creation then I'll raise it back up a bit. But, something like this might prevent future crashes like virt1005 last week [17:38:55] Ah-ha. So /that/'s what happened to 1005? [17:39:17] maybe? [17:39:33] OOM seems like the most likely explanation. And that's the only tool I have to prevent such things :) [17:40:19] In other news, this looks so crazy that it almost has to be wrong: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=mem_free&s=by+name&c=Virtualization+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [17:40:23] well, related news [17:40:24] I always feel a bit uneasy with any overcomitment, personally, but we don't have infinite hardware and budget. [17:41:22] I dunno, it looks pretty consistent to me. [17:41:47] Coren: for one thing, why is virt1009 red and virt1007 green? [17:41:56] But also, really, only 10M free? That seems… alarming [17:48:25] Wait, I must be looking at the wrong thing because I see over 1T free over the set. [17:49:25] 150-ish on each. [17:49:50] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [17:53:28] Coren, if you scroll down and look at the individual graphs... [17:53:41] I bet there's a mistake, and some of the axis should be 'G' rather than 'M' [17:54:00] well, even that wouldn't explain the colors [17:54:32] Well, there's a bug indeed because clicking on any of them leads you to the per-server details and the numbers there are much more sane. [17:54:44] * YuviPanda shakes head in general direction of ganglia [17:57:09] Coren: for example? I just clicked through several and see the same numbers [17:57:21] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [17:57:50] https://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20eqiad&h=virt1002.eqiad.wmnet&m=mem_free&r=hour&s=by%20name&hc=4&mc=2 [17:58:00] Shows 168 some G of free [17:58:35] Ah, but the *bottom* graph shows the tiny number. [17:58:44] Ah, I'm looking at memory graphs, you're looking at disk space I think? [17:59:00] Oh, no, I see what you mean [17:59:24] wtf [17:59:27] It looks like it's calcularing free_memory differently than how it shows memory usage. Dumb dumb dumb. [18:02:21] In fact, that free_memory number seems to have no correlation to the actual numbers; free on virt1002 reports the same as the usage graph. [18:02:38] well… that's good news I guess [18:04:12] Bad metrics is bad news, but less bad than "omg no ram left!!one!!1" [18:05:05] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [18:05:11] is there an easy way (cli) to poll with a list of instanecs and ask 'has this been deleted'? [18:05:38] I can do it the slow way, by going to the nove page for the resource but meh [18:06:53] I've got 1900 of them to check so I really don't want to do it manually... [18:08:18] apergos: this may help: https://wikitech.wikimedia.org/wiki/OpenStack#novastats.py [18:08:28] Will require a bit of python to customize, but it's pretty straightforward [18:09:31] apergos: andrewbogott I *kinda* did something similar yesterday [18:09:50] I was considering polling for entries in the archive table for wikitech [18:10:03] apergos: https://gerrit.wikimedia.org/r/#/c/166902/ has python code, used to archive deleted instance metrics from graphite [18:10:08] apergos: note that the initial stage of that, gathering up the big dict with all instance stats, takes forever to run [18:10:28] apergos: I added an API call to wikitech to list current instances, so you can call that and check if the instance you're looking for is there [18:11:03] oohhh get_deleted_instances(): [18:11:07] that looks like the ticket [18:11:19] it can take an hour, I wouldn't care [18:11:55] apergos: get_deleted_instances() is in yuvi's code, right? [18:12:08] uh huh [18:12:10] YuviPanda: might be nice to merge your query stuff with mine, I bet yours is better :) [18:17:57] apergos: andrewbogott note that get_deleted_instances looks at *graphite* as well, and works only in that context. You need some list of 'current instances I am tracking', and then you can have a 'current instances on wikitech' and compare to find deleted ones [18:18:12] apergos: andrewbogott the other code to fetch them might be more useful :) [18:18:24] hm I want to avoid a graphite dependence actually [18:18:24] There's also just 'nova list --all-tenants' [18:18:39] and then some sed [18:18:48] andrewbogott: I guess the wikitech API is just the replacement for nova list [18:18:50] I"ll have a look at that output, maybe that's quicker [18:18:57] YuviPanda: yeah :) [18:19:08] apergos: https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=deployment-prep&niregion=eqiad&format=json is the equivalent output from wikitech. [18:19:20] if I get this going it can be turned into a cron job and clean up once a week or so [18:19:38] andrewbogott: I still need to have it output: 1. list of images + ids (so I can map those), 2. list of puppet vars + classes per instance. Will dig into the PHP later [18:19:48] got it [18:20:02] I'm tempted to use wikitech but having a reliance only on two things (nova and puppet) is likely best [18:20:08] if I can do it quickly [18:20:24] apergos: nova requires some auth as well, I think. andrewbogott? [18:20:31] that's why I wrote the wikitech thing rather than depend on nova [18:20:38] 3Wikimedia Labs / 3Infrastructure: mariadb10 s2/s4/s5 unreachable - 10https://bugzilla.wikimedia.org/69144#c4 (10Sean Pringle) 5NEW>3RESO/FIX This hasn't recurred, plus the box has been since upgraded to MariaDB 10.0.14 which is listed as including the MDEV 6455 fix. Worth watching carefully, but no need... [18:21:20] YuviPanda: for a one-off on the commandline, https://wikitech.wikimedia.org/wiki/OpenStack#novaenv.sh [18:21:39] if apergos needs a recurring reliable process, then… I would have to think about that while not in a meeting :) [18:22:02] andrewbogott: pffft, meetings are best places to think :) although maybe I've been going to different kinds of meetings. [18:22:14] yeah I'm mostly listening also [18:23:14] * YuviPanda probably won't make them until Nov [18:27:47] ot puppet keys. salt keys. grrr [18:27:52] anyways... [18:29:25] how is the instance name derived from anything? it's going to be annoying if I have to run hostname on each of them via salt to do the mapping [18:32:29] apergos: you're asking about correlating the ec2 name with the human-readable name? [18:32:37] yep [18:32:53] nov seems to list the instance by hostname and a long id that is unrelated to the instance name [18:33:07] i-00000220 = dpeloyment-db1 r whatever [18:33:08] That's pretty much what I wrote that nova stats tool for. [18:33:13] i need to be able to get at that [18:33:21] There are actually three ids, the nova id, the ec2 id, the hostname. [18:33:32] nova id and ec2 are unique, hostname not necessarily unique. [18:33:42] I guess the i-xxx is the ec2 name? [18:33:45] yeah [18:34:04] guess I'd better look at nova-stats then :-D [18:34:11] Maybe [18:34:19] (wikitech api lists them too :) ) [18:37:10] andrewbogott: Oh, that reminds me - if we tweak the images to accept the new_install root key before its first puppet run, we can have the nova rescue mode working. I've been testing it when mls died and that's the only thing we need to have it work. [18:38:16] Coren: That's probably fine. A year or so ago we discussed the merits/demerits of including keys in new images. The disadvantage is it means we can pretty much never revoke access (since many many labs instances have broken puppet) [18:38:24] But I don't know if that's a necessary paranoia [18:38:43] Well, it only needs to exist until a succesful puppet run so the manifest can remove it. [18:39:10] oh? Then what is it good for? Maybe I don't understand what 'nova rescue' is [18:39:15] It doesn't need the key ongoingly, just to set up? [18:40:14] nova rescue creates a new empty image with just the OS, and puts the "real" one on the second virtual device. Because of keys, puppet can't run in the rescue image (and we wouldn't want it to) but if we can log into it we can then mount and fix the real one. [18:43:03] ohh Coren, virt1006: known dead? or ..? (it's the only host salt knows that doesn't respond in prod) [18:44:27] apergos: I know it's not in actual use, though andrewbogott probably can tell you why. [18:44:59] virt1006 was misbehaving -- we migrated everything off of it and shut it down and now I'm not sure what the right thing to do is. [18:45:14] But, yeah, not in use for the moment. [18:45:22] Oh, right, it's that one. [18:45:51] ah ha [18:45:58] The frustrating thing is that it's really hard to limit the scheduler to particular hosts. So I'd like to re-image it and keep it in a probationary state... [18:46:10] but that feature in nova either doesn't work or I don't understand it :( [18:46:11] well if it comes on line, with the same name and install, it will need a manual salt minion upgrade [18:46:20] if it's repurposed then there's no worries [18:46:27] apergos: if it comes online it'll be with a fresh OS install [18:46:32] excellent! [19:07:13] !ping [19:07:13] !pong [19:15:23] 3Wikimedia Labs: wildcard release parameter fails with jsub - 10https://bugzilla.wikimedia.org/72248 (10merl) [19:15:23] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [19:34:47] anyone know what deployment-soa-cache is for? is down [19:40:40] Coren: I just sent a message to labs-l with details of possible database corruption in enwiki_p [19:41:51] hm, heya, i created 3 new instances in the services project about an hour ago [19:42:01] and they are still on BUILD (scheduling) Instance state [19:48:23] andrewbogott: merge https://gerrit.wikimedia.org/r/#/c/167646/? [19:48:35] I've all of toollabs and deployment-prep on shinken now :) [19:48:48] now to clean up the generator code a bit, make it into a package, and deploy that with cron... [19:49:03] hmm, not cron, I can just trigger it from puppet [19:49:38] andrewbogott: ty [19:53:44] is there a known issue about indices on enwiki_p? [19:54:01] i'm seeing queries taking forever that used to complete sub-second [19:54:05] like select rev_user, count(*) from revision_userindex where rev_timestamp between '20141001000000' and '20141002000000' group by rev_user; [20:06:06] Coren: have you seen https://bugzilla.wikimedia.org/show_bug.cgi?id=72226 [20:07:57] Betacommand: looks like that's the same issue I reported independently [20:08:15] russblau: yeah, emailed you about it [20:08:25] this really ought to be a "red alert" level bug! [20:08:37] which prompted me to nag Coren :P [20:09:01] Noted, ima poke Sean. [20:17:55] heya, andrewbogott, it looks like new labs instances aren't building [20:18:04] ottomata: ok, I will look [20:18:26] 3 in the services project aren't building, and i just double checked in the analytics project. same deal. [20:18:27] thanks [20:23:39] Betacommand: at least this is a better response than we used to get back in the good old Toolserver days :-) [20:25:46] russblau: Betacommand: Last I heard about this is that there are cases where OOM can cause some "skipped beats" in replication; I know Sean is working on a reliability fix there but I don't know how close he is to a solution. [20:30:56] Coren: ouch, this a a common issue then [20:31:33] Thankfully not that common, OOMs are rare and fairly well guarded against, but we expect there are little inconsistencies that crept in over time because of it. [20:34:00] ottomata: reluctantly fixed [20:34:15] "reluctantly"? [20:34:26] Oh. Overcommited? [20:34:50] Coren: does this require re-initializing the replica to fix it? [20:34:59] thank you andrewbogott, 2/3 active now [20:35:25] andrewbogott: I think we need to put "moar powar" higher on our priorities. I've been keeping an eye on CPU when it's memory I needed to be concerned about. [20:35:25] !log deployment-prep updated OCG to version ea10c93aca9bc1cae34f284fd74bb05d4b6a8cc6 [20:35:43] russblau: i have the same problem and opened a bugreport more than a month ago: https://bugzilla.wikimedia.org/show_bug.cgi?id=70711 [20:35:53] Coren: You mean buying new hardware? I'm waiting on a bid from HP and will order hardware as soon as they respond. [20:36:02] ah, should've cc'd you... [20:36:13] russblau: I don't know. I expect that if it turns out to be needed, we'll do the databases in rotation so it shouldn't be causing issues except for 2/3 of the available power. [20:36:39] andrewbogott: Ah. I knew you were looking into it, I didn't know we were that far along in the req process. [20:37:09] russblau: (It's easier now that the databases are functionally identical) [20:42:30] 3Wikimedia Labs / 3deployment-prep (beta): no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - 10https://bugzilla.wikimedia.org/72275 (10spage) 3NEW p:3Unprio s:3normal a:3None I'm getting 503 errors from beta labs when I visit http://en.wikipedia.beta.wmflabs... [20:42:53] Coren: why can sge resources defined in script head and not as argument be differently interpreted by qsub and jsub? [20:45:18] andrewbogott: sorry to bug ya but it looks like that 3rd instance is in the same state [20:45:31] also, nfs home not mounting? ( i don't really care about that right now, just fyi) [20:45:46] ottomata: I'd say just trash the third instance and try again [20:46:37] ottomata: what project/instance is having trouble with nfs? [20:46:51] services, otto-cass1 and otto-cass2 [20:47:45] mount... failed, reason given by server: No such file or directory [20:48:04] that's just via puppet though [20:48:09] haven't actually checked if it is failed [20:48:12] could just be puppet erroring [20:48:19] how long has this project existed? [20:48:36] Have other instances in this project access nfs properly? [20:49:17] andrewbogott: don't know, this is gwicke [20:49:19] 's project [20:49:26] this is my first time creating instances there [20:49:34] don't worry about it, unless he complains, or I do later [20:49:40] i really don't need the shared NFS there [20:49:42] ok. That project has shared directories turned off. So, those volumes can't be found because they don't exist. [20:49:46] ah! [20:49:47] ok! [20:49:49] You can turn them on via the project config if you like. [20:50:03] naw, its ok, but we should probably make puppet ok with that if they are turned off [20:51:00] that turns out to be hard [20:51:06] and it's on by default in new projects. [20:51:23] ottomata: you should have full rights to change anything in the project [20:51:50] I don't need nfs normally, so didn't notice so far [20:52:14] Merlissimo: jsub does not support qsub-like script headers - they are considered as normal executable files. If you need that functionality, you'll need to use qsub itself. [20:52:18] yeah i don't need it either, its ok [20:52:22] i just noticed in the puppet output [20:53:54] 3Wikimedia Labs / 3deployment-prep (beta): no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - 10https://bugzilla.wikimedia.org/72275 (10Greg Grossmeier) p:5Unprio>3Normal a:3Sam Reed (reedy) [20:55:24] Coren: sge script header are respected by jsub in generell. it only overrides some like queue name [20:56:14] ... they're really not supposed to since they get invoked with '-b y'. I'll have to look into it. [21:56:06] <^d> Who can adjust quotas? [21:56:31] * ^d has some ideas, but doesn't want to if there's someone capable about :) [21:56:36] <^d> *want to ping [21:58:46] ^d: what do you need? [21:59:18] <^d> We filed https://bugzilla.wikimedia.org/show_bug.cgi?id=71886 for deployment-prep. 43 instances seems to be bit tight now. [22:00:21] I can raise the quota a bit. Most of this is stalled, though, pending new hardware purchases. [22:00:49] <^d> Yeah, understand. Just a little more wiggle room is what we're needing, not a ton of space for expansion :) [22:01:16] Your current quota is 45 -- you've hit that already as well, I take it? [22:01:47] <^d> Ah, we're at 43/45. [22:01:58] so -- wiggle room? [22:02:29] <^d> I can probably get my current task done :) [22:23:47] <^d> Did gmond disappear in beta or something? [22:23:59] <^d> puppet's all broken on deployment-elasticNN boxes :\ [22:25:24] cough.. the 'uptime' of puppet not being broken [22:25:42] i think i saw something about gmond .. wait [22:54:36] qdel is still not working properly for me ,and im having to go onto the xgrid and use kill [22:54:49] is there a bug open for this, should I file one?