[00:01:12] ShiveringPanda: Oh yeah, it seems to be working, but is lagging. So nevermind :) [00:01:26] kaldari: :D ok [00:01:31] kaldari: some day we'll have real log collection [00:02:02] someday... [00:02:22] :D [00:02:30] bd808: you should move stashbot to tools [00:02:55] I'm close, except blocked on not being able to log into the elastic hosts [00:03:30] I have this mostly untested bot to replace the logstash part of the pipeline -- https://github.com/bd808/tools-stashbot [00:04:45] bd808: oh, I thought we fixed it? [00:04:51] bd808: by unchecking roles from wikitech? [00:05:20] * bd808 doubles checks that [00:06:36] Didn't the instance pages used to link to the config bits too? -- https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools-elastic-01.tools.eqiad.wmflabs [00:06:52] no idea, I've never used them :D [00:07:22] 6Labs, 10MediaWiki-extensions-OpenStackManager: Permissions error on creating Hiera pages on wikitech - https://phabricator.wikimedia.org/T121602#1883432 (10madhuvishy) @Krenair No. This is what happens: Permission error You do not have permission to create pages, for the following reason: Only global clouda... [00:07:50] ShiveringPanda: role::toollabs::puppet::client is no longer checked but my key is still not accepted [00:08:06] maybe I should just nuke the instances and start over? [00:08:12] bd808: nah let me look [00:08:45] 6Labs, 10MediaWiki-extensions-OpenStackManager: Permissions error on creating Hiera pages on wikitech - https://phabricator.wikimedia.org/T121602#1883440 (10Krenair) Yeah, same problem for me on the deployment-bastion page. I managed it at one point, might've been my cloudadmin right around that time though: h... [00:10:14] bd808: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class ::role::toollabs::puppet::client for tools-elastic-01.tools.eqiad.wmflabs on node tools-elastic-01.tools.eqiad.wmflabs [00:10:16] hmm [00:10:55] bd808: I can confirm they aren't in ldap [00:10:57] oh wait [00:13:40] bd808: hmm [00:13:44] 'failed publickey' [00:13:54] that's ldap I bet [00:13:58] oh yeah [00:14:03] maybe pointing to old stuff [00:14:06] puppet is running [00:14:07] yeah [00:14:08] or not [00:14:23] my hot shower just came back after not working for 2d, so I'm going to go use it right now. [00:14:30] by the time I'm back puppet should've run :D [00:14:32] brb [00:14:32] do it! [00:14:43] * bd808 wanders off for a bit too [00:15:22] 6Labs, 10MediaWiki-extensions-OpenStackManager: Permissions error on creating Hiera pages on wikitech - https://phabricator.wikimedia.org/T121602#1883457 (10Krenair) a:3Krenair [00:37:24] !precise is tools-precise-dev, not tools-dev-precise [00:37:24] Key was added [02:17:01] YuviPanda: Puppet did its magic and I can get into the elastic hosts again. ty [08:28:45] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:18] ^ is NFS server hardware issue [08:33:35] I see Labs is down [08:33:36] yes [08:33:36] we're working on it [08:33:37] hardware issues [08:33:37] Cool. [08:33:38] Thanks for working on it! [08:33:38] * Hydronium hands out tea [08:33:49] sorry to pester, but is there somewhere I can go to check on the status of the NFS outage? phab report? [08:33:50] MusikAnimal: I'll keep the /topic here up to date [08:33:50] okay :) sure you're working hard, thanks for your time! [08:33:51] np! [08:33:51] my feelings about NFS should probably be kept to myself [08:33:51] I will kill it some day. already most projects don't have it. [08:34:03] (03Abandoned) 10Yuvipanda: Add debian dir [labs/invisible-unicorn] - 10https://gerrit.wikimedia.org/r/242389 (owner: 10Yuvipanda) [08:34:04] andrewbogott: ^ so grrrit-wm and nagf (the tools running on k8s) still work [08:34:05] kubernetes victory! [08:34:06] is grrrit on kubernetes? [08:34:06] yes [08:34:07] tools.wmflabs.org/paws won't work since it explicitly depends on NFS [08:34:07] but neither grrrit nor nagf do [08:34:07] so [08:34:07] Man, NFS may suck, but /nothing/ sucks enough to explain how often our hardware dies. [08:34:08] It’s a single point of failure, yes, but the average lifespan for a server should be > 3 months [08:34:26] bd808: when it all comes back up and we move stashbot to tools, we should move it to raw kubernetes. [08:34:27] bd808: there's an NFS outage but kubernetes tools that don't explicitly opt in to NFS are fine now [08:34:32] PROBLEM - Host tools-worker-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.122) [08:34:45] My bot on grid is not working [08:34:45] also it cannot login to SSH [08:34:45] what`s going on [08:34:45] same here [08:34:46] oh [08:34:46] stuck at ssh handshake [08:34:46] you also cannot operate bot? [08:34:46] known issue [08:34:47] NFS problem [08:34:47] being worked on [08:34:47] see topic [08:34:48] What is NFS? [08:34:49] https://en.wikipedia.org/wiki/Network_File_System [08:34:49] hi guys, I'm having issues copying files to korma, it refuses the rsync. Do you know if something changed in the conf of the machine? [08:34:50] hello. Labs NFS is having issues so your instances might not be reachable. we're close to a solution, please stand by [08:34:50] PROBLEM - Puppet failure on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:36:37] things are recovering [08:37:28] wm-bot is already working :o [08:38:29] !log tools reboot tools-bastion-01 [08:38:38] andrewbogott: your servers are failing within 3 months o.O [08:38:41] that should bring the bastion back [08:39:01] !ping [08:39:01] !pong [08:39:05] !ping is pong [08:39:06] This key already exists - remove it, if you want to change it [08:55:49] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 959930 bytes in 3.169 second response time [09:08:27] resolved-ish? [09:09:04] tool crons should be back now [09:09:13] Earwig: pretty much everything should be back now. [09:09:22] thanks for your work :) [09:12:36] Wheatley: it should be all back now [09:12:42] HakanIST: it should all be back now [09:12:49] yes thank you [09:19:31] RECOVERY - Puppet failure on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:22:51] Woot. [09:26:04] thx [09:36:19] YuviPanda: still awake? [09:36:39] yes [09:36:48] (fixing other issues) [09:36:58] ahh okay! *reads up* [09:37:15] ahh, all of the things broke again :/ [09:37:54] addshore: except grrrit-wm and tools.wmflabs.org/nagf since they were on kubernetes without NFS :) [09:38:01] :D [09:38:15] none of my things broke as none of them are running ;) [09:38:33] Huh [09:38:48] I was going to ask if you could take a look at https://phabricator.wikimedia.org/T121095 but I'll let you continue fixing things :D [09:39:32] addshore: I commented [09:39:34] * YuviPanda kicks wikibugs [09:39:37] :) [09:40:39] 6Labs, 10Attribution-Generator, 6TCB-Team, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1883943 (10Addshore) Needed for an actual domain. AFAIK this is the only way? [09:40:50] oh, wikibugs listens to me ;) [09:41:48] 6Labs, 10Attribution-Generator, 6TCB-Team, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1883949 (10yuvipanda) We highly prefer not doing that, but if we must, you should get clearance from legal to ver... [09:43:19] * valhallasw`cloud hugs YuviPanda [09:45:01] I kicked wikibugs [09:45:09] valhallasw`cloud: we're still dealing with some cron issues [09:45:18] :( [09:47:28] addshore: personally, I'd just say 'use a *.wmflabs.org unless you absolutely must' and 'if you expect this to be a productionish service, host it elsewhere :D' [09:48:10] YuviPanda: I tried to read back on logs, but the IRC logs apparently are stored on NFS ;-D [09:48:14] must be its own domain ;) probably no reason it couldnt be more productionay though ;) [09:48:43] addshore: alright, needs legal signoff in that case. [09:49:09] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1883954 (10Addshore) [09:49:10] we haven't really internally discussed that at all either, so it might still end up being a 'no' afterwards. I don't have a clear definitive answer now [09:49:15] I can bring that up in the next labs meeting [09:49:23] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868930 (10Addshore) Ping @Slaporte and @ZhouZ as with the last ticket. The only thing I can see t... [09:49:47] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1883958 (10Addshore) [09:49:56] coolio! [09:51:35] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1883963 (10yuvipanda) a:3yuvipanda [09:51:42] YuviPanda: we should really have some letsencrypt magic in the labs proxy so we can support arbritrary domains :-p [09:51:50] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868930 (10yuvipanda) I'll bring this up in the next labs meeting coming Monday to see what others... [09:52:02] valhallasw`cloud: yeah, but it's also a problem when different people control the domain and the IP [09:52:11] I'm sure you experienced it with the pywikibot domain [09:52:35] 'make it a cname to wmflabs.org' ? :P [09:56:47] ok, cron isn't going to fire for about 15minutes now [09:57:26] YuviPanda: I'm also a bit worried about the weird de_DE.UTF-8 issue on tools-exec-1201. It's easy enough to kill that host, but it suggests something is wrong deeper down :/ [09:57:54] everything is wrong deep down [09:58:00] I didn't even get to look at that ticket :( [09:58:18] lol [09:58:23] I also really can't think of a reason why, of all options, de_DE.UTF-8 is the chosen locale :-D [09:58:24] I woke up was fighting something, took a nap, woke up to fight something else, now finished fighting something and continuing to fight something else [09:58:27] * valhallasw`cloud eyes addshore [09:58:38] :O [09:58:39] * YuviPanda invades pl [09:58:48] what's wrong with cron again? [09:59:08] nothing atm. I haven't disabled it yet. [09:59:19] some of the PAM stuff is conflicting with cron causing permission denied errors [09:59:25] so we have puppet disabled on the corn host [10:00:07] it's been taken down now [10:00:54] YuviPanda: any reason not to postpone that to after sleep? [10:01:05] what sleep? [10:01:19] the sleep you desperately need ;-) [10:07:47] !ping [10:07:47] !pong [10:07:48] ok [10:29:04] !pong [10:29:04] !pang [10:29:07] !pang [10:29:08] !pung [10:29:14] !pung [10:29:15] !derp [10:29:15] it is broken [10:29:37] !derp [10:29:37] Ok enough. I don't want to do this all day. [11:00:50] ^ those messages are from 2 years ago I think [11:01:01] and they are still here [11:02:59] cron should've been back a while ago [11:14:14] valhallasw`cloud: can you kick morebots? [11:15:06] 6Labs, 6operations: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1884200 (10yuvipanda) [11:23:30] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884211 (10MoritzMuehlenhoff) 3NEW [11:23:40] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884218 (10MoritzMuehlenhoff) p:5Triage>3Normal [11:24:05] YuviPanda: no [11:24:33] YuviPanda: I have no clue how to kick morebots in the old situation, and others did not care about a fabric solution, so, no, sorry. [11:29:50] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884221 (10hashar) @80686 is Manuel Schneider. He is involved since the very beginning of the Wikipedia project and, being an IT guy, most definitely registered an account on labs very early on. [11:35:06] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884229 (10valhallasw) It's an old svn account, without corresponding wikitech user, as far as I can see. (cn is 80686, but https://wikitech.wikimedia.org/wiki/User:80686 doesn't exist) [11:35:45] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884238 (10hashar) Other regulars that comes in mind: `01tonythomas` and `20after4` . No clue what their uid are though. [12:19:31] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884296 (10valhallasw) 01tonythomas has shell name tonythomas01, 20after4 has shell name twentyafterfour. [13:38:22] Hey folks. I'm getting some serious instability issues in the ores project. Anything up? [13:38:47] I saw the NFS stuff go by last night, but we're not using NFS, so we usually aren't affected by outages. [13:39:50] zhuyifei1999_: whoa, so reliable storage we have! [13:51:29] halfak: instability meaning? [13:51:41] valhallasw`cloud, not sure. [13:51:51] We keep dropping connections between the boxes in the cluster [13:52:06] E.g. celery stops talking to our web nodes [13:52:13] Redis stops talking to eveyone [13:52:17] But only for a minute [13:52:23] Then everything recovers and comes back online [13:52:46] If I just route traffic to our staging server (single box with web, redis and celery running together) we don't have an issue. [13:52:59] But then we only have 8 CPUs handling all of our traffic [13:53:02] Which is not enough [13:53:37] * halfak just work up YuviPanda [13:53:58] YuviPanda is very much asleep, I think ;-) [14:00:06] valhallasw`cloud: oh yeah, forgot about that. let it be dead, actually [14:11:37] chasemp: hey [14:11:50] chasemp: halfak is trying to create a new instance and running into the nonspecific 'failed to create instance' error [14:12:00] chasemp: can you check with him and see what's up? possibly out of quota [14:12:04] Instances: 10/10 [14:12:08] ok are existing instances fine? [14:12:10] yeah [14:12:17] seemed like some trouble there [14:12:21] ok [14:12:43] chasemp: yeah but that's just redis filling up disk [14:12:48] halfak: gimme a test URL? [14:12:54] For? [14:13:04] halfak: for ores? [14:13:12] http://ores-new.wmflabs.org/scores/wikidatawiki/reverted/32423425/ [14:13:16] ^ prod cluster [14:13:26] You'll need to update the URL because cache [14:13:58] ok it seeems to be coming back up [14:14:00] YuviPanda, my requests are getting though [14:14:17] yeah two workers up [14:14:22] we have a redis [14:14:32] let me get more up [14:14:55] halfak: so our plan is to setup ores-redis-03 with a bigger instance, provision the /srv first (by applying that role first and running puppet) and then adding the ores redis role [14:15:07] this will have persistance and a bigger disk [14:15:30] We need a good wait to limit the on-disk size. [14:15:38] I just want to preserve what is in memory for when we need to restart. [14:15:39] yes I'm making a puppet change for that stand by [14:15:45] kk [14:15:51] aof was a terrible idea for it and I should've seen this coming heh [14:16:28] oh yeah. lol [14:16:40] Basically using an on-disk DB! [14:17:24] * halfak teaches himself how AOF works quickly [14:17:30] http://redis.io/topics/persistence [14:17:33] we gotta switch to rdb [14:17:45] which has better characteristics for this I think [14:17:52] but ultimately we just need a bigger disk [14:18:10] +1 for rdb [14:18:33] Is it because we can't make our on-disk persistence match the size of our max-memory? [14:18:40] It seems that rdb would be better at that [14:18:52] Since it can *forget* in new snapshots [14:18:58] Whereas AOF can't forget [14:19:00] I think it was trying to cleanup and couldn't because it ran out of disk space [14:19:01] since it is append only [14:19:04] AOF compacts [14:19:09] Gotcha [14:19:11] but didn't compact enough [14:19:17] I see. [14:19:34] So we'd need more disk to have an AOF file big enough for our memory lru [14:21:01] for compaction ya [14:21:08] http://redis.io/topics/persistence [14:21:11] look for 'log rewriting' [14:21:57] halfak: hahahahaha the problem is also that we had *both* enabled due to an accident [14:22:08] wooopsie [14:22:09] so we ran out of space [14:22:17] fun fun fun [14:22:19] anyway [14:22:25] chasemp: did you manage to increase quota? [14:22:49] not yet I"m side tracked atm with a bit of sorting out things from teh insanity, but I will [14:22:52] go back to bed dude [14:23:11] halfak: is the precached down? [14:23:22] halfak: can you start it back up so I can verify it's all back and can go back to sleep? [14:23:59] chasemp: kk I updated quota [14:24:10] !log ores updated quota for instances to arbitrary number (40) [14:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [14:24:14] halfak: ^ [14:24:20] ok man thanks, but...go back to bed :) [14:24:47] chasemp: once precached is up and verifying this is fine :) [14:24:49] YuviPanda, should I direct the proxy back first? [14:24:53] halfak: yes [14:24:56] Or run precached against ores-new? [14:25:02] ores-new == prod [14:25:03] halfak: hmm [14:25:04] right now [14:25:07] halfak: the latter [14:25:10] k [14:25:12] run precached againsg ores-new [14:25:18] and we can look at logs a bit [14:25:25] Off it goes [14:25:33] * halfak loves his little command-line utilities [14:25:42] take a second to change one param and off it goes! [14:26:02] Seems to be working pretty well. [14:26:05] halfak: hmm, only one worker is working [14:26:20] others are all at [14:26:22] Dec 16 14:17:15 ores-worker-02 celery[19865]: 2015-12-16 14:17:15,472 INFO:ores.scorer.scorer -- Loading ScoringContext 'wikidatawiki' from config. [14:26:28] restarting on -02 again [14:26:52] YuviPanda, it's a big model [14:26:57] Might take a couple seconds to load [14:27:02] Shouldn't be that long though. [14:27:17] It's got to put ~28MB into memory [14:27:20] * halfak runs a test [14:27:35] ooooohhhhh [14:27:40] -01 has a different loglevel! [14:27:42] that explains it [14:27:48] lol [14:27:49] wat [14:28:01] yeah I am sure we set it for testing some day in hiera and forgot to turn it off [14:28:14] :) [14:28:15] I straced and can see it doing the work [14:28:17] ok [14:28:32] https://wikitech.wikimedia.org/wiki/Hiera:Ores/host/ores-worker-01 [14:28:34] bam [14:28:36] there [14:28:44] just deleted it [14:28:52] halfak: ok, redirect proxy now? [14:28:52] I just confirmed that we can load all of revscoring into memory, query the wikidata API and apply the RF model in 3 seconds. [14:28:57] kk [14:29:04] halfak: yep, was red herring due to that log thing [14:29:14] halfak: we have only 4 workers right? :D [14:29:23] right [14:29:26] ok [14:29:28] all are up [14:29:47] halfak: ok, so we point the proxy back and check and then I can go back to sleep. Is that ok? [14:30:00] Yes [14:30:04] <3 YuviPanda [14:30:15] halfak: I can setup the other redis when I'm back up awake for real since there's a tiny bit of trickery involved to make sure just rdb works right [14:30:17] Proxy is back [14:30:22] precached is running [14:30:39] YuviPanda, cool. [14:30:49] * halfak watches monitor for a minute [14:31:12] Looks like we're doing good. [14:31:13] halfak: so what I did was: 1. hand hack redis config file to turn off aof, 2. physically move the files out of /srv/redis, 3. restart redis [14:31:40] YuviPanda, gotcha. So that cleared our redis persistence out entirely? [14:32:00] yes [14:32:04] so all our old cache is dead [14:32:14] No worries :) [14:33:05] yeah [14:33:10] we can survive that :) [14:33:14] :) [14:33:24] OK. Still online and working at full steam [14:33:29] I think you're OK to sleep again [14:33:33] halfak: ok I'm off now. feel free to call me if needed again [14:33:47] Will do. thanks again. [14:34:06] np [14:34:17] halfak: can you also put together an incident report? [14:34:26] I'll add details when I wake up [14:34:36] Yup. Just scheduled time for it [14:34:48] also third outage of the day! woo [14:34:49] * halfak copy-pastes Yuvi's notes for the report [14:35:04] Oh god. Isn't it 6:30 AM there? [14:35:44] halfak: yup. I went to sleep at 4 since we had an NFS outage and also some PAM stuff [14:36:02] :( OK. More sleep plz [14:38:37] YuviPanda: go to beeed [14:38:48] valhallasw`cloud: am already in bed! [14:38:51] if I had +o, I would +b you from this channel :> [14:38:54] under a blanket actually. it's all cold [14:38:58] heh [14:39:03] ok ok I'm going now [14:40:44] 6Labs: Syslog messages for user violating the nslcd "validnames" constraint - https://phabricator.wikimedia.org/T121630#1884456 (10scfc) [14:40:46] 6Labs, 10Labs-Infrastructure: Remove shell user "80686" - https://phabricator.wikimedia.org/T63967#1884457 (10scfc) [14:45:03] andrewbogott, chasemp: ok to merge the idle_timelimit patch now? [14:45:22] moritzm: sure [14:45:49] k, going ahead [14:46:47] andrewbogott: man, good morning? [14:47:16] morning, at least! [14:47:45] So I think my patch last night + https://gerrit.wikimedia.org/r/#/c/259471/1 [14:47:52] should have access acting normally again [14:59:44] valhallasw`cloud: maybe you can, there is this command @kb [14:59:53] you don't really need to be op :P [15:15:43] 6Labs, 5Patch-For-Review: Thousands of duplicate /etc/pam.d/*.orig files which may be messing with our pam config - https://phabricator.wikimedia.org/T121533#1884500 (10Andrew) 5Open>3Resolved a:3Andrew I've updated the script to move .orig files into a diferent dir outside of pam's view (/etc/pambak), a... [16:24:18] !log tools rebooting tools-exec-1221 as it was in kernel lockup [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [16:56:41] https://wikitech.wikimedia.org/wiki/Incident_documentation/20151216-ores [16:56:49] YuviPanda, [16:56:58] valhallasw`cloud, ^ (if you're curious what happened) [16:57:03] Nothing to do with NFS [17:56:27] 6Labs, 6operations, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1884995 (10jcrespo) 3NEW [17:57:55] 6Labs, 6operations, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885006 (10jcrespo) [18:00:56] 6Labs, 6operations, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885029 (10Krenair) 5Open>3Invalid a:3Krenair not an operational issue [18:20:13] 6Labs, 6operations, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885079 (10Legoktm) Instead of deleting, please use templates like https://wikitech.wikimedia.org/wiki/Template:Archive [18:37:01] hi YuviPanda [18:38:36] do I need to do something to restart my tool? error.log just says 2015-12-15 23:54:02: (log.c.166) server started [18:38:49] I'm getting HTTP 500 errors when I go to it [18:39:52] 10PAWS: FileNotFoundError: [Errno 2] No such file or directory: 'generate_user_files.py' - https://phabricator.wikimedia.org/T120266#1885127 (10jayvdb) p:5Low>3High [18:41:01] 10PAWS: FileNotFoundError: [Errno 2] No such file or directory: 'generate_user_files.py' - https://phabricator.wikimedia.org/T120266#1849694 (10jayvdb) I've split the underlying problem (which isnt resolved) off as T121667. [18:54:16] fhocutt: try webservice restart? Could be left over from the NFS issues [18:54:45] 6Labs, 6operations, 10wikitech.wikimedia.org: Please delete https://wikitech.wikimedia.org/wiki/Schema_changes or give me permissions to do it - https://phabricator.wikimedia.org/T121664#1885191 (10jcrespo) No, I had already archived the old page, I needed to delete it because, for some reason, it didn't all... [19:00:56] fhocutt: which tool is this? [19:01:06] you've got quite a few :) [19:01:57] hey yalls, [19:02:05] i'm experimenting with using scap3 for deplying eventlogging in deployment-prep [19:02:39] hmmmm, nm, i think i can just use an existing group! [19:02:40] uh! [19:02:41] duh! [19:02:47] nm i had an ldap questino, but i think i can avoid it [19:02:47] nm! [19:03:21] irc duck debugging? ;-) [19:05:09] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1885305 (10Slaporte) Hi @Addshore, could you add a line to the Legal Notice page explaining that th... [19:05:10] hehe [19:05:11] yup [19:05:48] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1885307 (10Addshore) That shouldn't be an issue! [19:59:02] 6Labs, 10Labs-Infrastructure: Remove shell user "80686" - https://phabricator.wikimedia.org/T63967#1885498 (10hashar) [20:36:07] !log hhvm deleting instance hhvm-build1 on Giuseppe and Moritz’s advice [20:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Hhvm/SAL, dummy [20:56:47] valhallasw`cloud: should we restart all webservices? [20:56:58] YuviPanda: I don't know? [20:57:07] hmm [20:57:32] seems to be mostly up [20:58:21] !log puppet-ca-cert deleting all instances, presumed abandoned by Gage [20:58:21] puppet-ca-cert is not a valid project. [20:58:23] ok, I guess we can let it be [20:58:55] !log puppet-ca-replacement deleting all instances, presumed abandoned by Gage [20:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-ca-replacement/SAL, dummy [21:11:42] andrewbogott: Failed to reboot instance puppet-test02.maps-team.eqiad.wmflabs [21:11:51] I seem to be admin [21:11:59] not sure what is going on... [21:12:09] akosiaris: Sometimes that failed to reboot is bogus, if you reload is it in rebooting state? [21:12:34] it's in active [21:12:55] ah, it seems to have rebooted indeed [21:13:05] ok... heisenbug. thanks andrewbogott [21:27:02] YuviPanda: tools-docker-registry-01.tools.eqiad.wmflabs seems to not like my root key. Yours? [21:27:32] andrewbogott: yeah, just kill it. [21:27:38] ok :) [21:28:19] !log tools deleted tools-docker-registry-01 [21:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [22:04:21] YuviPanda: is there a hiera key already for teh novaproxies? [22:04:34] also, andrewbogott YuviPanda sorry I've been quiet I'm deep in scary salt territory [22:04:45] I became worried our one-off cleanups were more inconsistent and bad than we thought [22:04:46] and yeah [22:05:12] I'm trying to distill the thoughts but essentially talked to apergo's and our salt master has none of even the rough fixes [22:05:13] I guess [22:05:19] so it's just kown bad state [22:07:15] but there are a few considerations: salt minions that have no responding client and salt seems not to know exist, salt minions we think exist but cannot talk to, and salt minions that exist [22:15:24] yeah [22:15:42] Mostly I think we should abandon salt as soon as possible. I don’t have a lot of ideas for coping in the meantime [22:15:55] except, anytime I do something with salt I give it 6 or so tries [22:25:10] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jarbot was created, changed by Jarbot link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Jarbot edit summary: Created page with "{{Tools Access Request |Justification=I would to create articles via the bot for the Arabic Wikipedia and need for this Tools. Thank you. |Completed=false |User Name=Jarbo..." [22:29:49] chasemp: what do you mean by 'hiera key for novaproxy' [22:30:43] YuviPanda: in prod we have a key like [22:30:43] $cache_misc_nodes = hiera('cache::misc::nodes') [22:30:57] that you can use to get a list of the reverse proxies [22:31:13] chasemp: https://wikitech.wikimedia.org/wiki/Hiera:Project-proxy so it's only available inside that project [22:31:15] why? [22:31:34] this is how mod_remoteip is set up dynamically for prod [22:31:52] it's also possible we don't want this in labs at all... [22:32:43] I don't think we care about mod_remoteip? [22:32:46] do we? [22:33:25] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1886168 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I guess? [22:33:33] I think I'm just going to if $realm == production this [22:34:03] chasemp: heh, usually instead I just have 'enable_feature' as a param and set that to default to true and set it to false in labs hiera [22:34:08] rather than $realm == production [22:34:15] much clearer that way [22:34:36] aaarggh gridengine [22:34:39] I don't think we ever really want phab in labs logging real ip's [22:34:43] or making that easy etc [22:35:06] chasemp: sure. just I hate realm == production branches since the intent is often unclear to someone cleaning up years later and I've been that person too many times [22:35:35] yeah, but I don't want to have a default true enable on a feature I don't want to land in labs I guess [22:35:40] oh sure [22:35:40] only because in this case it's user ip's etc [22:35:42] then default to false [22:35:46] and set to true in production [22:35:53] can you point me to an example? [22:35:56] sure [22:36:17] chasemp: well, just look at hieradata/labs.yaml [22:36:25] sure [22:36:26] archiva::proxy::ssl_enabled: false [22:36:28] is one example [22:36:32] base::remote_syslog::enable: false [22:36:34] is another [22:36:36] standard::has_admin: false [22:36:38] too [22:36:41] and standard::has_ganglia: false [22:36:54] and varnish::dynamic_directors: false [22:36:55] etc [22:36:57] anything with false [22:37:44] I'll sort it out [22:37:46] t [22:37:49] tx [22:37:54] chasemp: k [22:45:19] chasemp: andrewbogott I'm going to go find food and try to make my way to the heated office. anything you want me to do before that? I might be gone for a couple of hours. [22:45:21] ugh [22:45:25] I need to fix gridengine before that [22:46:07] YuviPanda: um… I’m good, except, sorry about grid engine :( [22:47:56] andrewbogott: :D [22:48:04] !log tools run qmod -c '*' to clear error state on gridengine [22:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:50:59] andrewbogott: ok, I've done the gridengine stuff. feel free to call me if people report more bad thing [22:51:01] *things [22:51:31] YuviPanda: can you text me your number? shows up as unavailable when you call [22:52:06] chasemp: it's in officewiki, but I'll send you a text. [22:52:41] thansk this way if I'm afk and get some page and all the things are bad [22:52:46] we can at least coordinate [22:53:04] chasemp: sent you a message [22:53:07] chasemp: ack? [22:53:13] yup thanks [22:57:55] chasemp: andrewbogott ok, am off for real now! cya in a few hours if y'a'll are here [22:58:06] later [22:58:29] I have two tasks started before the NFS failure [22:58:32] and they stuck there [22:58:41] qdel them doesnt make them go away [22:58:56] task id 175779 and 372844 [22:59:29] their values in state column of qstat are already dr [23:01:00] YuviPanda: …… [23:01:02] YuviPanda: ^ [23:02:53] liangent: I'm not sure how to remedy, andrewbogott any guesses? [23:03:47] I don’t know! I would’ve thought that qdel would do it. [23:03:50] liangent: what host is this on? [23:03:50] I will poke around a bit [23:04:58] I bet valhallasw`cloud would have a quick answer if he is around [23:05:18] liangent: what tool? [23:05:27] andrewbogott: liangent-php [23:05:52] 6Labs: ssh as system users not allowed in labs - https://phabricator.wikimedia.org/T121721#1886283 (10Ottomata) 3NEW [23:06:23] chasemp: 175779 on tools-exec-1407 [23:06:37] 372844 the same [23:07:12] and I see one more task from another user stuck on that host too [23:07:25] andrewbogott: ssh hangs there for me...general resource issue? [23:09:19] I don’t know. I can’t ssh either [23:09:23] there’s nothing upset in the syslog [23:10:59] yeah so I can't get in at all [23:11:05] not sure what to do there [23:11:20] I’m texting yuvi [23:11:34] not sure we should reboot it yet, since I can’t tell what harm this is causing [23:11:43] right could cause more havok [23:11:49] we need less havok [23:14:01] chasemp: yuvi says ‘reboot' [23:14:39] it's on labvirt1003 [23:14:44] do you use nova to restart these hard? [23:14:49] !log tools rebooting tools-exec-1407, unresponsive [23:14:54] I’m just rebooting via wikitech [23:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [23:15:18] I don’t see any issues on labvirt1003, just looked — plenty of ram, plenty of disk [23:16:01] liangent: was this a long-running job, something there since yesterday? [23:17:01] andrewbogott: it was started at 2015-12-09 02:36:37 [23:17:14] not a continuous job [23:17:15] liangent: wow, ok [23:17:21] but I did expect it to run for days [23:17:40] it consumes wikidata dump file [23:21:56] andrewbogott: okay they're gone now [23:22:40] and tools-exec-1407 is back [23:22:43] so I think we’re good [23:23:01] sorry if your job was killed right before it finished. NFS, as always :( [23:40:01] NFS ....blehhhhhhh