[00:02:07] andrewbogott: sorry wandered away, just a developer from en.WP who was one of the bot herders wringing their hands. [00:03:01] Hasteur: no worries, I'm a bit caught up now. [00:03:30] It's funny when I get asked to code review operations/puppet changes for WP when they really wanted Hashar. [00:44:33] 28m compile, not amazing but better :) [03:08:59] robla: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_tin : git fetch; git log HEAD..origin/wmf/1.23wmf10 [03:09:49] that shows changes +2d in gerrit that aren't on tin, and the only one should be your own change you're about to deploy [03:10:13] "If there are other changes besides yours, go yell at the culprit" :) [03:41:59] * bd808 waves to logstash-labs-wm [04:27:48] !log logstash Testing irc input to logstash [04:29:00] do newly created tools have some delay before their passwords can be accepted by mysql servers? [04:30:18] liangent: I believe yes, but it should be probably less than a minute. [04:31:30] scfc_de: ok one minute passed and its still not usable [04:33:15] Coren: ^ [04:33:45] liangent: What's the tool's name? [04:34:44] !log logstash Another irc test with filter to only retain !log messages [04:34:48] scfc_de: liangent-django [04:37:32] liangent: Could you please file a bug in case Coren's not able to fix it immediately? [04:37:46] scfc_de: The delay could be up to 5 minutes. [04:38:15] (But that's really worst-case) [04:38:47] Coren: ok it works now :) [04:39:08] `ssh -v -v bastion.wmflabs.org` isn't responding, "debug2: we sent a publickey packet, wait for reply" then nothing [04:42:02] Coren: k [04:42:36] spage: Possibly the bastion suffered a bit from the outage. I go see. [04:42:44] In the meantime, you can use bastion1 or bastion2 [04:43:04] liangent: You created the tool yesterday, 19:30Z? [04:43:16] Coren: thanks. I have an ssh session proxied through bastion.wmflabs.org which is staying up, but can't create a new one [04:44:06] I get "If you're having access problems" from bastion.wmflabs.org, so it is responding in some way :-). [04:44:23] Yeah, but it does seem ill. I can't log in with a root key either. [04:44:37] And bastion2 doesn't seem to fare much better actually. [04:44:44] Eff, I was heading to bed. [04:45:19] bastion3 seems okay. [04:46:27] Now bastion and bastion2 fail directly. [04:46:46] Coren: Short question: What was the issue with liangent's tool? [04:47:21] Coren: sorry. andrewboggott rebooted bastion.wmflabs.org last night (or two nights ago), he thought it might have been OOM [04:47:47] Coren: Hiccup due to the outage? [04:47:55] scfc_de: Yes. [04:48:35] We'll probably suffer kinks leftovers for a day or two. Loosing the internal network and all its backups like that isn't healthy. [04:48:51] spage: Yeah, I think I'll have to do the same. [04:52:53] scfc_de: no I created it tens of minutes ago [04:54:00] and any possible reason that a tool migrated from toolserver doesn't work with an error message "Proxy Error // Reason: Error reading from remote server" after a long wait (time out)? [04:54:38] spage: Rebooted 1 and 2; 3 is already okay. I go to bed now. [04:54:40] * Coren waves. [04:54:55] thanks, le weekend beckons [04:55:11] liangent: In /data/project/liangent-django, cgi-bin was modified yesterday, 19:14Z, and replica.my.cnf 19:30Z. [04:55:19] Coren: Good night. [04:56:00] liangent: Re error message: Doesn't sound familiar. Are you using "webservice start"? [04:57:12] scfc_de: really .. maybe I forgot ;P I started migration just now [04:57:40] :-) [04:57:41] maybe ... I created it yesterday and decided to start migration but the fiber cut blocked that [04:58:01] Probably. [04:58:21] (And it blocked the creation of the DB user on the replicas :-).) [04:59:02] scfc_de: yes I'm using the new web because I need fcgi [04:59:58] liangent: Okay, then that sounds as if the webproxy (tools.wmflabs.org) tries to fetch the page from your webserver, but something times out. [05:02:09] liangent: There's an error message in /data/project/liangent-django/error.log that seems to be from before your tool's DB access. Perhaps restart the webserver? ("webserver restart" IIRC.) [05:13:25] scfc_de: is there someway to ask the webserver to flush error.log other than restarting it? [05:13:28] btw it's webservice restart [05:15:37] liangent: "Flush"? My thinking was more that your FCGI was in an error state and so to restart it. [05:17:13] (I don't know how long FCGI holds scripts in memory so "webservice restart" is my sledgehammer.) [05:17:44] scfc_de: write(2) then fsync(2). lighttpd doesn't write error.log after every request. at least not write to disk [05:19:44] liangent: Ah, okay. No, I don't know how to force lighttpd to flush the log. [05:23:55] Good night! [05:55:07] "User p50380g51016 already has more than 'max_user_connections' active connections" [05:55:20] I haven't seen this on Toolserver with this tool [05:55:27] is Labs using a lower limit? [12:32:01] ssh: Wrong passphrase [12:32:01] o_O somthing changed [12:32:02] ? [14:19:02] "ssh: Wrong passphrase" is a message local to your computer. [17:00:23] is project:puppet-cleanup the correct project if i wanted to spin up a temporary instance to try some changes to operations/puppet.git ? [17:01:15] jeremyb: if ^^ can you add User:JanZerebecki to it? [17:06:15] i have no idea [17:06:28] and i have to run in a min [17:33:22] jeremyb: oh. anyway it's probably right. can you add User:JanZerebecki as project admin to project:puppet-cleanup when you get back? [18:31:31] Coren: The SGE master seems to be down; qstat fails on tools-login and tools-master itself. [18:32:55] Huh. Interesting. I'm pretty sure that's the first time I saw that. [18:32:57] Restarted. [18:34:15] Still not responding? [18:34:36] Huh. Went back down [18:34:48] It did work for me longe enough to go away. [18:35:36] * Coren checks the log to see why it went down. [18:36:19] hmmm. [18:36:48] SGE down since at least 16:15 UTC. [18:37:14] or maybe 17:15 UTC. [18:38:41] It goes up when I start it, and works perfectly for a little why, then goes away quietly. [18:41:22] OOM? Free are ~ 200 MByte, then that goes down, and it's back at 200. BTW, what does salt-minion need 600 MByte of VMEM for? [18:43:29] No, that's related to the outage; I see the error and I'm not the first to see it (generally as a consequence of things like power fails or hardware breaking) [18:45:44] Oh, poop. [18:47:30] Still, a machine with 2 GByte of memory and only 200 MByte free without any users or services looks very odd. [18:48:05] scfc_de: You're reading the wrong line, dude. :-) [18:48:12] There's 1.9G free on that box. [18:48:45] "Mem: 2051744k total, 1827256k used, 224488k free, 11412k buffers"? [18:49:05] Okay, "free" adds the buffers. [18:50:53] You're also lacking cached in there. Use 'free' when you want real numbers. [18:51:14] Okay, okay, okay :-). So: OOM is not the issue. [18:51:39] There was some corruption in the job queue, apparently originating on -09 [18:52:24] And it's running now again?! [18:53:17] Because I've restarted -09. It probably wasn't the root cause (bad entry in the job db) but having it forcibly get rid of the jobs will have solved it. [18:55:08] Where's the log file for the master? [19:01:19] Ah, I see the root cause. the execd on -02 got really messed up at some point, and had leftover partial jobs in its spool that, when reported upon, confused the heck out of the master [19:01:48] -09 [19:02:18] scfc_de: in /var/spool/gridengine/qmaster [19:04:40] And now everything's unconfused? BTW, "denied: host "tools-webserver-03.pmtpa.wmflabs" is no admin host": someone trying to change something from the webserver? [19:05:06] Yeah, I was wondering that same thing myself. [19:05:23] Yeah, everything seems unconfused since I've cleaned up the -09 spool [19:06:01] Okay, thanks. Re webserver-03, the calls seem to be exactly every minute? [19:06:25] 30s [19:06:40] No crontabs at all on webserver-03. [19:06:48] As should be. [19:07:13] But it's saturday here. A little noise in the logs isn't going to stress me overmuch. :-) [19:08:28] Enjoy your weekend :-). [19:44:41] Coren, replication back up? [20:03:51] seems to be [21:06:29] What is the templateditor right? [21:08:22] Kolega2357, it's a new right created on the English Wikipedia, it allows users with it to edit template protected pages [21:08:40] It's a new protection level between semi and full protection [21:09:31] that only applies to templates [21:09:39] And modules