[01:17:39] flinfo isn't on the toolserver or wmflabs right? [01:38:48] What's flinfo? [01:47:32] http://wikipedia.ramselehof.de/flinfo.php? [01:52:17] comets: Nope, neither. I'm pretty sure there are comparable tools though. [01:53:49] i figured it went kaput because of the toolserver migration in a few days....nvm then :) [04:22:42] !ping [04:22:42] !pong [06:45:00] !log graphite repointed graphite.wmflabs.org to diamond-collector.wmflabs.org, since graphite instance doesn't seem puppetized nor used. [06:45:03] Logged the message, Master [06:45:13] Damianz: ^ [12:21:18] building of instance continuously returns various errors such as file not found, unable to create socket, failed to init interface and bunch of others [12:23:47] puppet status - failed [13:22:37] wtf is this [13:23:24] valhalla1w: An IRC channel [13:23:44] no, my SGE job acting weirdly [13:23:59] but I guess this last time it was just a slow NFS [13:36:42] !log local-heritage Burned the old ~erfgoed account on the Toolserver and uploading the backup to ~/toolserver_backup/ [13:36:44] Logged the message, Master [16:51:49] tools.wmflabs.org is broken, "504 Gateway Time-out" [16:52:20] shh and ping works [16:52:44] shh your own self... [16:52:49] @ping [16:52:53] !ping [16:52:53] !pong [16:53:33] Sorry, nevermind me.. thought I was in -offtopic for a moment. [17:08:52] !log local-locator Burned the old ~locator account on the Toolserver [17:08:53] Logged the message, Master [17:19:02] 3Wikimedia Labs / 3tools: Add some of the missing tables in commonswiki_f_p - 10https://bugzilla.wikimedia.org/59683#c11 (10merl) Also on s5 (others not testet) >SELECT * from commonswiki_f_p.page limit 3; ERROR 1296 (HY000): Got error 10000 'Error on remote system: 1054: Unknown column 'page_content_model'... [17:39:58] no admin to get a look why nginx return only 504 ? [17:47:04] 3Wikimedia Labs / 3tools: tools-webgrid-02 completely overloaded / unresponsive - 10https://bugzilla.wikimedia.org/67279 (10metatron) 3UNCO p:3Unprio s:3critic a:3Marc A. Pelletier Timeouts on tools-webgrid-02, even ssh'ing is impossible. [18:18:32] 3Wikimedia Labs / 3tools: tools-webgrid-02 completely overloaded / unresponsive - 10https://bugzilla.wikimedia.org/67279 (10Liangent) 5UNCO>3NEW [18:21:38] scfc_de`: ^ [18:21:48] scfc_de`: wonder if we shuld restart it [18:22:17] as it is, things are dead [18:24:12] !log tools rebooted tools-webgrid-02. Could not ssh, was dead [18:24:13] Logged the message, Master [18:28:11] GerardM-: doesn't sseem to have fixed anything [18:28:18] GerardM-: might need Coren [18:28:44] * Coren looks into it [18:28:49] graphite tells me it has been down for an hour now [18:29:11] scfc_de`: graphite.wmflabs.org :) [18:29:19] I'm getting "no route to host"; it's probably rebooting now. [18:29:29] It is. [18:29:56] And it's woken up now. [18:30:18] Coren: now http://tools.wmflabs.org/?status is 502 [18:30:23] (was 504) [18:31:12] Well /admin/ (what serves the root) was probably on -02 when it died; it'll take a moment for gridengine to respawn the stuff that was running there. [18:32:20] I see the execd has restarted; the master should notice the dead jobs and restart them gradually about now [18:32:49] so it's becoming 503 now :/ [18:36:50] liangent: There is nothing unusual about this; that's normal behaviour for the proxy as the backend sucessively times out, fails, is disabled and (after some time) will be restarted. [18:39:14] A quick inspection of the logs shows that webgrid-02 just plain out ran out of resources. I'll have to add a new webgrid node given the increasing number of tools. [18:43:04] yeah i was surprised we have only two of those [18:45:07] how long will it take for resources to get back on line ? [18:46:21] should've been up now [18:46:25] Coren: ^ [18:47:53] 502 Bad Gateway [18:48:00] while trying to view a map from wma.wmflabs.org [18:48:42] 503 Service Temporarily Unavailable [18:48:44] for me [18:49:08] * YuviPanda has a cast on one hand, can't do much [18:49:19] meh, whatever. let me give it a shot [18:49:27] YuviPanda: graphite looks complex; we still have to define dashboards to have a Ganglia-like experience? [18:49:39] scfc_de: indeed. I've a grafana patch in the works [18:49:48] scfc_de: http://grafana.org/ [18:50:21] Coren: lots of connection refuseds on the proxy. looks like the reboot didn't clear the proxy entries. [18:50:38] Coren: I'm going to clear all the proxy entries and restart proxylistener, can you reboot all web instances? [18:50:51] Shouldn't the restarts of the jobs reset the proxy entries? [18:50:58] me too 503's [18:51:00] scfc_de: should have [18:51:09] but apparently not [18:51:36] scfc_de: do you know how to restart all webgrid jobs? [18:52:14] scfc_de: hmm, actually no. jobs there seem to keep restarting [18:52:41] scfc_de: on -02 that is [18:52:43] YuviPanda: Nothing more precise than "qstat | filter | qmod -rj"? [18:53:45] Coren: still around? [18:54:24] * andrewbogott reads backscroll... [18:54:42] Is the grid back up and running? I just got 266 emails about failed jobs :( [18:54:53] I think the web for tools.admin is different which causes http://tools.wmflabs.org/ to still be 503. Are the other web apps okay? [18:55:26] andrewbogott: grid seems to be having issues. I see constant restarting from looking at proxylistener [18:55:32] Ah, still? [18:55:50] * twkozlowski can only help in saying he is now getting 503s [18:55:53] I can't tell from my bouncer if your conversation with Coren about the webgrid nodes was a few minutes ago or many hours ago [18:55:53] andrewbogott: yeah. atleast two jobs seem to be on a restart loop [18:55:59] andrewbogott: it was a few minutes ago [18:56:25] I can also use only one of my hands (physio, one hand is immobilized) so can't do too much :( [18:56:33] YuviPanda: link me to labs graphite? [18:56:41] andrewbogott: graphite.wmflabs.org [18:56:55] hm, that's simple enough [18:57:16] YuviPanda: Which ones? [18:57:25] Hm, no aggregate graphs for tools yet, huh? [18:57:27] (Jobs restarting.) [18:57:27] scfc_de: wikiviewstats and wikidata-todo [18:57:36] http://graphite.wmflabs.org/render/?width=1133&height=630&_salt=1404068217.589&target=tools.tools-webgrid-02.network.eth0.tx_byte.value&target=tools.tools-webgrid-02.cpu.total.system.value&target=tools.tools-webgrid-02.cpu.total.user.value&target=tools.tools-webgrid-02.network.eth0.rx_byte.value [18:57:43] is graph of rx/tx on -webgrid-02 [18:57:51] YuviPanda: wikiviewstats is hedonil's own stuff; that is not necessarily related. [18:58:09] (I. e., the grid restart and his watchdog script may collide.) [18:58:15] scfc_de: ah, hmm. right [18:58:21] andrewbogott: yeah, am going to setup grafana soon [18:58:37] YuviPanda: you got wrist surgery? Or skateboarding accident? [18:58:54] andrewbogott: no, started physio for Carpel Tunnel. [18:59:20] andrewbogott: way more painful than actual CTS. I'm doped up on some strong painkillers too [18:59:31] ouch [18:59:43] yeah [18:59:47] So, it looks like maybe those jobs have settled down. At least I've stopped getting emails... [18:59:47] Carpal tunnel syndrome (CTS) is a median entrapment neuropathy that causes paresthesia, pain, numbness, and other symptoms in the distribution of the median nerve due to its compression at the wrist in the carpal tunnel. [18:59:51] YuviPanda: And the doctor said: "So shred your other hand to smithereens?" :-) [18:59:53] sounds pretty ouch. [18:59:59] scfc_de: hehe :) [19:00:21] I've gotten reasonably fast at typing one handed [19:00:27] * twkozlowski tries to imagine YuviPanda processing a one-hand keyboard course [19:00:34] heh [19:00:50] Ah, I see the issue [19:01:43] last famous words [19:01:52] Something during the boot sequence effed up the permissions on /var/run/lighttpd [19:02:53] How annoyingly destructive; that'll have prevented most webservices from sucessfuly restarting. [19:03:31] I can probably do a quick script to find which failed to restart for that reason and force them back up though. [19:04:22] almost none started, I'd suspect. [19:04:29] I'm looking at the network graphs and they're pretty flat [19:04:35] (http://graphite.wmflabs.org/render/?width=1133&height=630&_salt=1404068217.589&target=tools.tools-webgrid-02.network.eth0.tx_byte.value&target=tools.tools-webgrid-02.cpu.total.system.value&target=tools.tools-webgrid-02.cpu.total.user.value&target=tools.tools-webgrid-02.network.eth0.rx_byte.value) [19:04:41] both in and out [19:04:47] ah, seem to be coming back up now [19:05:03] Yeah; I'm going to do a quick check for those that failed to restart cleanly and give ;'em a shove. [19:06:30] scfc_de: https://gerrit.wikimedia.org/r/#/c/133274/ is the grafana patch, still needs some love if you've the time :) specifically the CORS fix that bd808 mentioned [19:06:40] and in general porting from nginx to apache2 [19:06:54] scfc_de: graphite-test is running it now, exposed at charcoal-test.wmflabs.org [19:08:32] Reasonator et al are a step further.. it reports No webservice [19:08:33] The URI you have requested, /toolscript/index.html?pastebin=guXqaGQE, is not currently serviced. [19:08:45] GerardM-: right, that's just a 404 [19:14:14] GerardM-: tools.wmflabs.org is back again. check resonator? [19:14:32] network traffic still at about 1/2 of average though [19:15:19] Yeah, I've just forcibly restarted all webservices which failed to restart because of the permission problem, but it will take several minutes for gridengine to schedule them all as it won't flood the node. [19:15:29] right [19:15:49] Coren: we should add a couple of new nodes tomorrow or something. Can I watch as you do it? [19:16:03] I'm also seeing a couple of "Webservice already running"; presumably sfety net cron jobs [19:16:06] what does it mean ... add more noeds [19:16:28] YuviPanda: Sure; it's not all that complicated. I'm probably going to do this first thing tomorrow. [19:16:52] GerardM-: means the web services will have more resources to run on, less likely that this will happen [19:17:08] GerardM-: There are two nodes from which web servers are run at the moment, I'll add at least a third so that we don't run out of resource4s again [19:17:09] Coren: cool. I should be around, but if not, can you wait for a bit for me? I'd like to add the second one :) [19:17:12] ... is a node an entry point, a set of systems ? [19:17:16] (assuming we add a second one) [19:17:27] GerardM-: It's an actual (virtual) server [19:17:36] YuviPanda: Sure, I'll wait for you. [19:17:41] Coren: ty [19:18:17] given that servers are virtual, what is the cost of having one ... if it is not used it idles no ? [19:18:21] Every webservice that had failed when the box restarted should now be queued or started [19:19:07] GerardM-: indeed, which is why I reccomended that we add two more [19:19:16] GerardM-: To a point -- there is overhead in having a running server that is all idle (ram, background daemons, etc). It's not too high though, which is why I'm not hesitating to add one now. [19:19:47] Coren: andrewbogott scfc_de once my grafana work is done, I'm thinking of setting up http://cabotapp.com/ to do alerts [19:19:48] having all service crash is way more expensive surely [19:20:10] icinga, at least the way it is done in prod, is unviable for labs, so we should look at other alternatives. this lets us leverage diamond/graphite to do alerts too [19:20:27] one negative is that alerts are in a db and not version controlled, which bothers me [19:21:03] also hilarious in that it has hipchat integration but no irc [19:21:19] YuviPanda: is that used elsewhere at WMF? Might be good to have some consistency in monitoring tools... [19:21:37] I'm already a bit overwhelmed by the tool diversity in that realm [19:21:54] YuviPanda: Some monitoring is better than none :-). Though I'd prefer if Icinga was available so that there is a correlation between services set up by Puppet and monitored by X. [19:22:15] andrewbogott: yeah, it isn't. I'll talk to akosiaris (can't get his name right) and others first as well. [19:22:41] andrewbogott: right. but prod is also replacing ganglia with grafana+graphite+diamond (same as labs right now) so should unify soon [19:22:44] (I am aware of the problems of user-defined data on the monitoring hosts; just my dreams ...) [19:23:05] scfc_de: it's also that the prod icinga code is... not very labsable anyway [19:23:19] YuviPanda: yeah, graphite was the right choice over ganglia. [19:23:26] Although I still don't really know how to work it [19:23:49] andrewbogott: if you see the grafana patch I've been working on (https://gerrit.wikimedia.org/r/#/c/133274/) it started out for prod and I've been adapting it to labs. [19:23:55] andrewbogott: and diamond is also the same code for labs and prod. [19:23:56] AFAICT, everything is back up. [19:24:19] andrewbogott: yeah, me neither (re: graphite :D). it has a terrible UI, but I played around with grafana and it seems like once we setup some dashboards it'll be a much better alternative [19:24:34] Coren: yeah, network traffic seems to be back to normalish levels [19:25:04] andrewbogott: graphite is fairly powerful, though - once you get past the stupid UI [19:25:17] YuviPanda: Web service still borked? [19:25:27] multichill: should be back up now... [19:26:05] YuviPanda: http://tools.wmflabs.org/reasonator/?&q=5590831 is still broken [19:26:16] hmm, didn't restart properly, I'd think [19:26:51] Coren: Can you give the webservice for the reasonator account a kick in the nuts? [19:27:36] multichill: Queued. [19:28:55] multichill it is not only reasonator [19:29:00] also the query tool [19:29:11] reasonator often relies on it [19:29:52] scfc_de: andrewbogott another approach is to setup icinga, but have it just query graphite for monitoring. that would let us use same tools as prod, but in a completely different configuration. Unsure if that's good or not [19:30:29] !log local-heritage Web service was down for all accounts. Back up and running. Api seems to have been down from 19:30 to 21:15 (Amsterdam time) [19:30:30] Logged the message, Master [19:31:28] YuviPanda: would it be completely different? That seems like a reasonable thing for us to do in prod as well... [19:31:36] YuviPanda: My pref as stated: Some is better than none. If Graphite already has alerts on its own, adding Icinga seems to increase complexity for me, and I would do without it. I'm mostly concerned about a drift between services set up by Puppet and monitored by X. [19:31:38] Although I'm not sure if we have alerts associated with perf metrics already... [19:32:18] andrewbogott: IIRC we don't, but I'll check. NRPE sends things to icinga, and diamond sends things to graphite, and a lot of times they are the same type of things [19:32:39] scfc_de: graphite by itself has no alerts. so we'll have to add another system anyway. only question is which one. [19:33:02] 3Wikimedia Labs / 3tools: tools-webgrid-02 completely overloaded / unresponsive - 10https://bugzilla.wikimedia.org/67279#c1 (10metatron) Aha, seems to be back on track now. Please restart /magnustools as many of his tools rely on this one. [19:33:31] GerardM-: What's the query tool name? I can see about kicking too if needed. [19:34:10] andrewbogott: scfc_de another option is www.shinken-monitoring.org which prod wants to move to eventually [19:34:14] YuviPanda: Graphite is like cacti right? Looks more like capacity and performance monitoring than incident monitoring [19:34:23] multichill: indeed. [19:34:44] If you're fully puppet, than Nagios Core is probably easiest to setup [19:35:07] multichill: right, but it would use resource collection, which is apparently not workable in labs [19:35:39] A kick in the nuts seems pretty harsh. [19:35:41] reasonator et al do not get their data [19:36:17] YuviPanda: Recourse collection? I don't know that concept in Nagios. [19:36:33] YuviPanda: You seem to have all the knowledge to make an educated decision :-). [19:36:43] Host, services, contacts, host groups, service groups, contact groups, commands, etc [19:37:03] multichill: it's a puppet thing. you setup what you want to monitor and in that particular role itself and then on the icinga host you 'collect' all those 'resources' (which are checks) and then it runs. [19:37:23] multichill: so it's a way of managing what needs to be checked by defining them with the services themselves, rather than in a centralized location [19:37:24] Oh right, push model [19:37:27] multichill: that's what production uses [19:37:44] I use Nagios in a 100% pull model (mostly SNMP) [19:38:41] multichill: right. [19:39:00] scfc_de: :D right. [19:39:15] YuviPanda: Depends a bit on what you want to monitor what is easier to setup. So for networking stuff it's all SNMP. For *nix and Windows it's easier with daemons [19:39:35] multichill: right. networking stuff I usually don't have to worry about, so it's all the 'other things' [19:39:36] And you can just mix it [19:40:55] right. problem is with collecting resources in labs is apparently problematic. [19:42:54] ok any clue what is up with Reasonator et al ? [19:43:30] it does not get its data [19:43:56] http://tools.wmflabs.org/reasonator/?&q=2 [19:44:23] /magnustools is also down [19:44:26] Probably https://bugzilla.wikimedia.org/show_bug.cgi?id=67279#c1 [19:45:43] !log tools magnustools: "webservice start" [19:45:45] Logged the message, Master [19:46:07] THANKS [19:46:47] 3Wikimedia Labs / 3tools: tools-webgrid-02 completely overloaded / unresponsive - 10https://bugzilla.wikimedia.org/67279#c2 (10Tim Landscheidt) 5NEW>3RESO/FIX (In reply to metatron from comment #1) > Aha, seems to be back on track now. Please restart /magnustools as many of > his tools rely on this one.... [20:58:34] Silke_WMDE: around? [21:00:11] Danny_B: No. I'm about to go to sleep. Sorry. [21:15:38] petan: ping [21:15:38] Hi Steinsplitter, you just managed to say pointless nick: ping. Now please try again with some proper meaning of your request, something like nick: I need this and that. Or don't do that at all, it's very annoying. Thank you [21:42:59] !log local-heritage I put the RCE mysql conversion [[User:Akoopal]] made in ~/rce-nl-data . Still need to import it in Mysql to be useful. Data is CC0 [21:43:01] Logged the message, Master [21:48:00] Coren: Do you know anything about webservices being stopped for tools? [21:48:15] I've ran into two random tools today that had no webservice running anymore [21:48:26] tools.guc which got its service stopped around 18:30 UTC [21:48:48] and tools.intuition [21:48:50] also around 18:30 UTC [21:48:54] had to manaully start them [21:49:35] Krinkle: All webservices died [21:49:36] https://gist.github.com/Krinkle/f9a76590cf6b99f3686e [21:49:50] Did someone break something in maintenance? [21:50:09] No, we don't have any monitoring so we didn't notice that it ran out of resources and crashed [21:50:21] Is it not tracked anywhere which were running? [21:50:27] Nope [21:50:31] I can't keep up all the time with this stuff... [21:50:47] Poke your employer about being professional [21:52:04] Yuvi is working on getting some things set up now I believe Krinkle [21:55:12] Not all webservices, only one of two nodes. [21:55:19] Still bad, mind. [23:02:49] Who maintains static? [23:04:39] ireas, ping [23:05:44] ireas, static is down. Can you reactivate it? [23:19:34] Cyberpower678: one moment, I’ll have a look [23:20:10] ireas, any idea why the web service went down in the first place? [23:20:32] Cyberpower678: no. I restarted it, seems to work now [23:20:53] Cyberpower678: I have not logged in to static for quite a while, so there was no manual modification [23:20:55] ireas, if it OOMed, I can give you hedonil's improved web service. [23:21:50] Cyberpower678: what’s the difference? [23:22:13] hedonil wrote it to essentially maintain itself. [23:22:47] If it dies, it automatically restarts, and it kills hanging requests to avoid clogs. [23:23:22] Cyberpower678: we could try it :) [23:23:48] it would be extremely helpful to get an e-mail notification if the webservice dies … [23:24:51] ireas, I'm not sure if it emails you, but it writes to access.log