[00:46:46] PROBLEM - Puppet failure on tools-worker-04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:50:53] ^ fixed [01:01:40] RECOVERY - Puppet failure on tools-worker-04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:33:32] chasemp: andrewbogott goddamit, tools-worker-07 is stuck now [01:33:35] it wasn't stuck earlier in the day [01:33:41] I'm losing like one node every day now... [01:34:08] Is this a second go-around for this one? [01:34:16] I’m wondering if all of these are latent effects from the NFS death [01:38:19] andrewbogott: no [01:38:35] 6Labs: Create new labs project for wikimetrics - https://phabricator.wikimedia.org/T122108#1896824 (10madhuvishy) 3NEW a:3yuvipanda [01:38:36] andrewbogott: I rebooted it two days ago when almost all nodes were locked up [01:38:50] post-nfs failure? [01:38:54] yes [01:38:59] dang [01:39:07] well, wtf? This never happened before [01:39:09] same for tools-worker-08 [01:39:19] this is concerning... [01:39:44] it is... [01:39:58] do we lose forensics after a reboot? [01:40:14] the previous time this happened [01:40:18] dmesg didn't have anything [01:40:32] ok — I was going to say, we could copy the disk image before rebooting [01:40:36] but not if there’s nothing to learn [01:40:38] maybe I can reboot tools-worker-08 and see what happens [01:40:45] there probably is. can you copy it anyway? [01:40:57] yes — 07 or 08? [01:41:27] 6Labs: Create new labs project for wikimetrics - https://phabricator.wikimedia.org/T122108#1896834 (10yuvipanda) [01:41:29] 6Labs, 10MediaWiki-Revision-deletion: Need to access revision histories of wikipedia pages - https://phabricator.wikimedia.org/T122035#1896835 (10yuvipanda) [01:41:29] andrewbogott: 07 [01:41:31] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1896833 (10yuvipanda) [01:41:50] I'm rebooting 08 now, since that locked up earlier yesterday and could be NFS (but maybe not) [01:42:05] !log tools rebooting tools-worker-08 [01:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [01:42:42] I guess I can just copy the live-running image since it’s just a cp and we aren’t planning to start up the copy ever... [01:43:05] yeah [01:43:56] YuviPanda: ok, copy grabbed [01:44:09] I’m a couple of hours late for dinner, so inclined to wait until tomorrow for forensics [01:44:26] andrewbogott: ok [01:44:37] I rebooted tools-worker-08 and it has some pretty crazy shit in the console [01:44:44] um, let me put a copy in /tmp so you can look [01:44:48] ok [01:45:09] /tmp/9729beea-cffa-4351-a246-1758e27fa666/ contains everything [01:45:14] and, let me find you a link for mounting... [01:45:48] these directions are not great, but will get you there: https://wikitech.wikimedia.org/wiki/OpenStack#Mounting_an_instance.27s_disk [01:46:19] ok [01:46:33] I'll look around at -08 and see if I can see something useful [01:46:36] I want to blame ldap, but can’t imagine how... [01:46:43] but the crazy shit in the console log is all NFS sounding [01:46:57] well, ok, yes, I also want to blame NFS. Either one :) [01:47:18] wtf all that is gone from log now [01:47:20] * YuviPanda boggles [01:47:22] ok [01:47:29] anyway, go dinner, andrewbogott :) [01:47:30] you’re talking about the log on wikitech? [01:47:33] yah [01:47:41] it’s erratic, I don’t know how it decides when to refresh [01:47:50] but it should all still be in syslog, right? [01:48:00] I guess maybe console does not always equal syslog [01:48:26] but, ok, back later or tomorrow… text in case of disaster. [01:48:40] * andrewbogott waves [01:48:53] yeah [01:48:58] cya [02:11:14] YuviPanda: do you have a minute to make the wikimetrics project? [02:26:47] madhuvishy: yea let me do it [02:27:10] madhuvishy: done [02:37:09] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1896896 (10scfc) IMHO all maps should use the new maps server. Personally I'm more interested in the in-article-pop-up-thingy (i. e. the frame for exampl... [02:43:02] YuviPanda: thanks much [02:43:38] 6Labs: Create new labs project for wikimetrics - https://phabricator.wikimedia.org/T122108#1896900 (10madhuvishy) 5Open>3Resolved This is done! Thanks Yuvi. [02:43:40] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1896902 (10madhuvishy) [02:53:36] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1896916 (10Yurik) @scfc - I'm working on that as part of the [[ https://www.mediawiki.org/wiki/Extension:Kartographer | Kartographer extension ]]. See [[... [02:57:17] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1896920 (10Dzahn) merged the hiera change above. ran puppet on neon, it would still add it, used puppetstoredconfigclean.rb to remove store... [03:01:20] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1896923 (10Dzahn) @chasemp @andrew can i enable puppet on labtestcontrol2001 (even if just for a while) or does that mess with testing? [03:39:50] YuviPanda, ping [09:32:52] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1897210 (10cmarqu) There are services other than Wikis that are using some of these tiles (I can only speak for the hikebike and hillshading tiles), like... [09:54:17] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1897254 (10Kghbln) > I'm kinda surprised to see the toolserver replacement tiles.wmflabs.org go so soon. This implies that there were plans form the sta... [10:33:26] Hello. I have a question. Recently I joined a project and made a project myself, but on both I don't show up as maintainer... If I check at Special:NovaServiceGroup on wikitech it does show my name... [10:33:32] 6Labs, 10Labs-Infrastructure: Start pdns after opendj - https://phabricator.wikimedia.org/T65717#1897305 (10MoritzMuehlenhoff) Andrew, is that still the case with the new OpenLDAP servers? [10:34:02] atleast, on tools.wmflabs.org... [10:35:55] 6Labs, 10wikitech.wikimedia.org: Can't reset password on wikitech (Unicode passwords not accepted), due to LDAP/opendj? - https://phabricator.wikimedia.org/T58114#1897311 (10MoritzMuehlenhoff) @Nemo_bis: please let me know if anything needs to be fixed/changed on the OpenLDAP side of the new labs LDAP servers. [10:37:51] 6Labs, 10Tool-Labs, 10Labs-Team-Backlog: Set up A-based SPF for tools.wmflabs.org - https://phabricator.wikimedia.org/T104733#1897317 (10MoritzMuehlenhoff) >>! In T104733#1810829, @valhallasw wrote: > What's the ETA for the LDAP backend change? I'm also confused why this is > blocking changing the current re... [10:40:18] 6Labs, 10Labs-Infrastructure, 6operations, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1897323 (10MoritzMuehlenhoff) 5Open>3Resolved This is available in the new servers based on OpenLDAP. [11:54:24] 6Labs, 10Tool-Labs: Figure out a way to support java 1.8 on tool labs (Merl's bot) - https://phabricator.wikimedia.org/T121279#1897426 (10valhallasw) >>! In T121279#1874520, @Merl wrote: > Are there any plans to support ubuntu 16.04 as sge execution nodes next year? If so the easiest solution would be to keep... [11:57:57] 6Labs, 10Tool-Labs: queues on tools-exec-1402.eqiad.wmflabs and tools-exec-1203.eqiad.wmflabs are disabled - https://phabricator.wikimedia.org/T122125#1897429 (10valhallasw) 3NEW [12:03:52] 6Labs, 10Tool-Labs: Increase number of tools-webgrid-lighttpd-14xx nodes - https://phabricator.wikimedia.org/T117488#1897446 (10valhallasw) 5Open>3Resolved a:3valhallasw I think this is mostly resolved. We increased the swap space (T118419) and Yuvi added four more hosts. [12:04:11] 6Labs, 10Tool-Labs: Cannot login to tools-exec-1409 (host dead?) - https://phabricator.wikimedia.org/T121860#1897460 (10valhallasw) 5Open>3Resolved a:3valhallasw Andrew rebooted the box. [12:05:15] 6Labs, 10Tool-Labs: jsub and utf8 - https://phabricator.wikimedia.org/T60784#1897464 (10valhallasw) {T121505} is somewhat related. [12:07:16] chasemp: https://phabricator.wikimedia.org/project/profile/983/ -- is the 'don't use' comment still valid? The tag is being used currently already, and I'd like to sort tools mail related issues into #tool-labs + #mail [12:08:31] 6Labs, 10Tool-Labs, 5Patch-For-Review: webservicemonitor: AttributeError: 'module' object has no attribute 'utcnow' - https://phabricator.wikimedia.org/T115225#1897489 (10valhallasw) Is there any cleanup still to be done? [12:21:03] bah, it isn't helpful that the DBs provided by Tool Labs are latin1 and latin1_swedish_ci (?!) by default. YuviPanda, Coren [12:22:14] lol [12:22:19] abartov: Might want to file a task [12:23:23] * abartov grumbles [12:23:42] abartov: please file a task. It's possible to change that default, but a) it requires a configuration change on the database server, and b) it might break code that depends on that default [12:24:20] 6Labs, 10Tool-Labs: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897627 (10Ijon) 3NEW [12:24:54] 6Labs, 10Tool-Labs: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897643 (10jcrespo) Not sure if it should be utf8 or utf8mb4. [12:25:05] (I knew what the error meant and how to fix it, but I expect some tool developers wouldn't.) [12:25:06] 6Labs, 10Tool-Labs, 10DBA: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897644 (10valhallasw) [12:25:13] ah, jynus is quick [12:25:32] 6Labs, 10Tool-Labs, 10DBA: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897627 (10valhallasw) Edit conflict :( [12:26:45] YuviPanda, Coren, legoktm: btw, switched to dev.tools, but it still takes 18 minutes, compared to ~15-20 secs on my machine. What's up with that? [12:27:10] oh hang on, now it took just 1 minute. [12:29:37] 6Labs, 10Tool-Labs, 10DBA: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897692 (10valhallasw) Probably utf8mb4, with either the utf8_unicode_ci or utf8mb4_bin collation. I'm not entirely sure what the impact of this is on databases... [12:30:32] 6Labs, 10Tool-Labs, 10DBA: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1897701 (10jcrespo) Replica databases have to stay in binary collation. I am talking here about toolsdb, which certainly is misconfigured. [12:48:17] jynus: does that mean the default collation for user databases will be different depending on the server? [12:48:38] yes [12:49:25] Hm. I'm not sure if that's necessarily a better situation in terms of user confusion [12:49:40] that is already happening [12:49:52] this only makes it better [13:22:30] 6Labs, 10Tool-Labs: Implement metrics for tool labs (under NDA?) - https://phabricator.wikimedia.org/T121233#1898134 (10valhallasw) After looking around for alternatives, http://goaccess.io/man seems to be a reasonable option. It allows for both console-mode and html output, and is available through apt. Mos... [13:24:01] abartov: I'm sure I could give more useful advice if I knew what "it" was. :-) build something? [13:30:25] 6Labs, 10Tool-Labs, 5Patch-For-Review: Implement metrics for tool labs (under NDA?) - https://phabricator.wikimedia.org/T121233#1898182 (10valhallasw) See https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Local_package_management for documentation. [13:39:28] Coren: "it" is running a (Rails) rake task. [13:39:36] in the authorlang-game tool [13:40:09] Is there a lot of disk I/O during? [13:45:15] some -- Rails will be loaded a whole bunch of modules. [13:45:36] but my actual code was failing on the first iteration, so no crunching post-app-bootup. [13:46:54] Hm. Disk I/O is the single most variable cause of lag in labs, but readonly access is agressively cached so it shouldn't be that painful. [13:47:12] (Writing to lots of small files is what kills, usually) [13:49:25] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1898257 (10Yurik) I don't think this was intentional--at least I haven't heard of these boxes being scheduled to go down. I would think it's a bug somewhe... [14:12:23] hello. I asked a question before here and it didn't got any answer, so i want to ask again... I recenly joined a project and made one myself... both don't have my name as maintainer (it does show it on the wiki). Is there something I'm doing wrong here? [14:12:39] on toollabs, to clarify [14:12:57] which projects? [14:13:07] wikilinkbot and my own, wiki13 [14:13:38] on tools.wmflabs.org it doesnt show them, whiley they do on wiki.. [14:14:28] Wiki13: The landing page is only updated at interval, but perhaps something is wrong with the update process. I'll check. [14:14:37] Wiki13: according to LDAP you're a member of both, so I'm not sure why tools.wmfalbs.org shows otherwise [14:15:08] valhallasw`cloud: I've now lost track of the mail project fiasco :) but I think that can be used and should be renamed if something more specific comes up, honestly mutante would remember best at this point ha [14:15:54] Wiki13: as for your tool -- there might be some confusion because the tools is a member of the proejct [14:16:18] but I don't think it should break [14:16:23] oh, should i remove it? [14:17:10] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898297 (10chasemp) @dzahn sure man go ahead thanks [14:21:11] Coren: I was added yesterday for both, if that helps you... [14:27:43] 6Labs, 10Tool-Labs, 10DBA, 5Patch-For-Review: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1898323 (10jcrespo) ``` $ mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 201372010 Server ve... [14:28:26] 6Labs, 10Tool-Labs, 10DBA, 5Patch-For-Review: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1898325 (10jcrespo) 5Open>3Resolved a:3jcrespo ``` MariaDB TOOLSDB master localhost (none) > SHOW GLOBAL VARIABLES like 'collat%'; +---... [14:29:33] ^abartov [14:31:17] Coren: thanks. I'm definitely not writing to many small files. There's a single log file, I suppose, but it's not spouting much to it. [14:32:12] abartov: So, definitely not disk-bound. Hm. Can you try your run again with 'time' so we know what part is system vs user? [14:32:23] 6Labs, 10Tool-Labs, 10DBA, 5Patch-For-Review: DBs provided on Tool Labs should have utf8 charset and collation by default - https://phabricator.wikimedia.org/T122148#1898329 (10jcrespo) Of course, not all databases will need UTF-8 (utf8mb4) character set- those are the user's/application responsibility to... [14:32:54] Fantastic, jynus, thanks! [14:34:15] the issue here was that wikimedia's standard is binary for maximum compatibility, so non-wiki databases got forgotten [14:34:42] * abartov nods [14:36:43] Wiki13: in any case, it should not have any impact on what you're doing -- logging in etc should work [14:36:58] Coren: hmm, this one was relatively short: [14:36:58] real 2m10.284s [14:36:58] user 0m2.832s [14:36:58] sys 0m7.092s [14:38:10] Huh. Interesting. Could it be network-bound? [14:38:21] that's what it appears to be. [14:38:57] valhallasw`cloud: nice to know, hopefully the info makes it there sometime soon. I already had a tool online, which also doesnt show up (but that may well be related to the issue i just reported) [14:39:03] My Ruby environment is entirely in my homedir (rvm), thus on NFS, so I'm assuming that slows things down considerably, versus a system ruby env. [14:39:37] abartov: It really shouldn't, unless you're reading multiple tens of megabytes of scripts. :-) [14:39:59] Coren: okay. so we agree it shouldn't. :) Now what? [14:40:32] a simple do-nothing (prints usage) command to test with is `rails g migration` [14:40:46] abartov: Moar profiling would be good, though there are a couple of plausible culprits. If you're connecting to things by name you might be hitting IPv6 IPs first, and those time out before falling back to v4 [14:41:21] Wait, is that do-nothing thing what takes 2m? [14:41:35] mm. this command I was testing should not have done any resolving, but my actual code certainly does (it queries Wikidata and Wikipedia's APIs) [14:41:40] Coren: yes. [14:42:10] Coren: can you, in your ineffable power, become(1) any tool you want? [14:42:12] Something is definitely amiss. Add me to your tool so I can test? [14:42:26] abartov: I could, not anymore. [14:42:29] Coren: how do I do that? :) [14:43:02] abartov: On wikitech, "Manage Service Groups" in the sidebar [14:43:17] You should be able to add me to it. [14:48:47] Coren: {{done}} [14:49:13] become authorlang-game ; cd authorlang_game ; time rails g migration [14:49:56] * Coren straces [14:52:08] ... at first glance, I might see a plausible issue. Ruby is completely insane in the way it reads modules. [14:52:33] I can see... 30 to 40 disk system calls per file it opens. [14:52:53] hmm [14:53:01] but you said reads are aggressively cached...? [14:53:04] Why in Baal's name would it feel the need to repeatedly lstat() the same directories thousands of times? [14:53:35] Yeah, not the stat() call family because that explicitly fetches fresh info. Metadata vs data. [14:53:51] That's still completely batshit insane behaviour. [14:54:45] Coren: I tend to agree. Though I've never looked at ruby internals, so I don't know how it marshals modules etc. [14:56:31] Coren: bisy, backson. [14:56:42] If you're curious, /tmp/alg.trace has the dirt on it. So far as I can see, it tries dozens of places, doing stat() of every path component in the way, every time [14:59:11] Just that 'null' invocation that does nothing but print usage does 7846 stat() syscalls and 5401 open()s - most completely redundantly. I see it do consecutive fstat() of the same file several times. [14:59:16] Hm. [14:59:44] So we know why. Lemme thing to find a workaround for that braindead behaviour. [15:01:37] (Even at a very generous 10ms per metadata fetch, those idiotic stat and redundant opens are going to consume over 130s of time) [15:04:26] abartov: Googling a bit shows other people running into problems with (it seems to be) the implementation of 'require' [15:04:46] abartov: https://bugs.ruby-lang.org/issues/7158 [15:04:56] ^^ this suggests a patch to ruby itself to fix it. [15:05:23] He says "These patches speed up a large Rails application's startup by 2.2x, [15:05:23] and a pure-'require' benchmark by 3.4x." [15:05:34] But I'm thinking you might get 10x to 30x because NFS [15:08:55] "It not only triples amount of stat/open calls, it is like 17.5x faster. I am filling new bugreport for this." [15:08:58] rofl [15:10:11] it's funny to see how all that 'invent a different wheel' stuff tends to run up to the same limitations that made the old wheels complicated :) [15:11:04] thedj: admittedly, unix locale handling makes more things painful ;-) [15:11:11] what, people find that grease/lubcrication is needed after all? ;) [15:12:46] classic mistake of "all of these edge cases are ruining the code!" cue a rewrite and 2 years of relearning edge cases [15:14:46] chasemp: It's a symptom of the early 2000 philosophy of "screw efficiency, computers are getting faster/bigger/etc". Bit coders when resource-constrained smartphones hit, but not that many people learned the lesson. [15:15:09] Though I don't expect ruby is a common platform on phones. :-) [15:16:23] and it's not that it's actually a bad process to do. It's just that people often expect the wrong things from it. [15:17:12] abartov: As far as you practical, immediate issue is concerned however... best thing I can think of offhand is to constrain $LOAD_PATH and $LEADED_FEATURES as much as practical. You might get order-of-magnitude improvement from that alone as it will greatly reduce the number of stat() calls. [15:17:37] (Especially $LOAD_PATH I'm thinking) [15:17:46] wait, would changing my LC_* envvars help? [15:18:37] abartov: LANG=C could make it faster, yes, because it needs to load less files [15:18:52] the rest of the LC_* inherit from LANG, so that's the only one you have to set [15:19:01] abartov: I don't think it will - the locale.rb but refered there is either fixed or doesn't hit you hard; the strace show very few stats of locale-related stuff [15:19:17] abartov: It might, but I don't think it's going to be major. [15:19:28] k, thanks. [15:19:44] (Worth trying if you don't actually need the locale support, anyways) [15:20:11] thanks, everyone. [15:32:01] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1898390 (10MaxSem) I attempted to log into the main instance for this, but it just timed out. Are any maintainers able to log in? NB: strictly speaking,... [15:33:19] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898391 (10Dzahn) 5Open>3Resolved after re-enabling puppet on labtestcontrol2001 and running it on neon, it now works as intended. the... [15:34:12] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 7Monitoring: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1898393 (10Dzahn) [15:53:25] (03CR) 10Ricordisamoa: "I don't know much about SPARQL, but Wikidata Query is going to be phased out: https://lists.wikimedia.org/pipermail/wikidata/2015-Septembe" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [16:37:20] (03CR) 10Ricordisamoa: Added a Wikidata-based "chart of the nuclides" under /nuclides (033 comments) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [17:10:12] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1898559 (10akosiaris) I 've rebooted the instance that serves those tiles, seems like it suffered a partial freeze due to an NFS outage earlier this week... [17:26:43] 6Labs, 6Discovery, 10Maps: Replacements for a.toolserver.org, b.toolserver.org, c.toolserver.org not available - https://phabricator.wikimedia.org/T103272#1898620 (10cmarqu) Thanks very much! That seems to also have triggered the process that is re-rendering the tiles when requesting so (i.e. with /dirty at... [17:30:54] chasemp: andrewbogott ^ I wonder if the instances being talked about here also are the same type of freeze tools-worker-07 is in [17:31:02] worker-08 had nothing useful in logs when unfrozen [17:31:55] when was that one? [17:32:03] which? [17:32:18] 08 [17:32:22] -08? it froze sunday, rebooted yesterday night. -07 froze yesterday [17:32:25] kk [17:32:35] and -07 had been rebooted since the NFS outage [17:32:37] and was working [17:32:37] just wondering if any today, that was a better question :) [17:32:48] I just woke up 15min ago, so none yet :D [17:33:09] I haven't announced PAWS widely yet because the nodes keep dying every day :( [17:33:24] graphite shows no anomalies either [17:35:35] YuviPanda: ARGH. Ssh to -07 was still alive, but I was stupid enough to do a ls [17:35:49] which is now hanging [17:36:37] valhallasw`cloud: augh, that's strange [17:36:56] valhallasw`cloud: was the ls on NFS? [17:37:01] yes, $HOME [17:37:07] seriously all signs point one way here [17:37:11] repeatedly hit ctrl-c? :D [17:37:21] gives me pretty ^C^C^C^C :-p [17:37:43] HULK SMASH CTRL-C? :) [17:38:14] so let me first cd / for all other hosts, and next time immediately strace sshd instead of doing anything else [17:38:21] ok [17:38:36] valhallasw`cloud: I've added -04, 05 and -08 back btw [17:39:15] ok. I have to ifgure out how to get agent forwarding to work in screen after re-attaching [17:43:14] chasemp: andrewbogott heh, this feels like one of those things that's gonna blow up on me ths week [17:44:49] * YuviPanda gets out of bed for real [17:44:51] brb [17:47:25] * Coren has no NFS issues atm [17:51:05] valhallasw`cloud: before I go, I wanted to point you to http://graphite.wmflabs.org/render/?width=588&height=310&_salt=1450806647.729&target=tools.tools-proxy-02.reqstats.line_rate [17:51:12] valhallasw`cloud: that's total requests, and uses logster [17:51:24] we can extend it to do per-tool counts easily as well, just needs a simple logster module regex [17:51:34] Coren: yeah, I am not either, except instances are strangely locking up. [17:51:41] YuviPanda: can we add alerts to that? [17:51:44] valhallasw`cloud: yeah [17:51:56] YuviPanda: Sure, but the locked up instances include many with no NFS whatsoever [17:52:32] YuviPanda: I don't think NFS is significant other than symptomatic of "kernel land dies even if [parts of] userland remains" [17:52:33] Coren: nope, all of the ones taht've locked up are on NFS [17:52:47] yeah, I don't feel like it's NFS solely [17:53:10] and it just seems like 'convenient thing to wag finger at' :) [17:53:56] YuviPanda: so something is off there, because there should be huge peaks from last week [17:54:37] mmm, wait. The issue was not so much the number of requests, but all of them hitting tools.admin [17:54:55] IY rgiyfgr rgR RGW J7A QIEJWEA SUSB:R SI BDA> [17:55:03] o_O [17:55:16] I thought that the k8s workers didn't do nfs? [17:55:17] valhallasw`cloud: last week it was tools-proxy-01 [17:55:24] YuviPanda: I know [17:55:45] YuviPanda: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1450806939.915&target=tools.tools-proxy-01.reqstats.line_rate&target=tools.tools-proxy-02.reqstats.line_rate&areaMode=stacked&from=-1months [17:56:15] And tools-proxy-* do not either afaik. [17:56:26] I guess line rate might not be the best measure [17:56:34] Coren: it did when it locked up. I removed NFS from it yesterday. [17:56:37] YuviPanda: ^^ tools-proxy-02 speaking of the devil... [17:56:39] Coren: the workers are on NFS right now [17:56:56] -02 seems dead [17:57:03] I can ssh in [17:57:15] but [17:57:17] 3 root 20 0 0 0 0 R 5.7 0.0 293:04.41 ksoftirqd/0 [17:57:19] mm [17:57:27] 13 root 20 0 0 0 0 S 2.7 0.0 143:48.82 ksoftirqd/1 [17:58:05] That's... broken. [17:58:13] Points to the virtualization layer [17:58:22] I've seen it before in other k8s nodes just before they died [17:58:32] well, for some definition of 'just' [18:00:00] https://phabricator.wikimedia.org/P2450 [18:00:04] /proc/interrupts [18:00:18] 25: 85614770 0 PCI-MSI-edge virtio0-input.0 [18:00:20] hmm [18:00:35] YuviPanda: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1450807229.258&areaMode=stacked&from=-14days&target=tools.tools-proxy-01.cpu.total.user&target=tools.tools-proxy-02.cpu.total.user&vtitle=CPU%20(user) [18:00:35] * YuviPanda looks at proxy-01 [18:00:46] proxy-01 cpu usage blew up as well [18:01:07] 13 root 20 0 0 0 0 S 5.0 0.0 4:47.91 ksoftirqd/1 [18:01:10] on tools-proxy-01 [18:01:13] which is under *no* load [18:01:25] andrewbogott: chasemp ^ these instances have no NFS either [18:02:32] Nearly 5 minutes of CPU time in less than a day idling? For the ksoftirqd thread? That's mad. [18:05:00] YuviPanda: For comparison, I have a fairly used server that does mail, web, and a dozen other services and never idles completely that gathered about 16m of cpu time for those trhreads in 206 days of uptime. [18:05:14] right [18:05:17] so this seems mad [18:05:32] all of the k8s nodes are also jessie [18:05:44] So not release-related. [18:06:26] no, the proxies are also jessie [18:06:36] Hm. [18:06:44] But the -exec- nodes aren't. [18:07:16] no, and tools-bastion-01 also has 3 root 20 0 0 0 0 S 0.3 0.0 4:37.56 [ksoftirqd/0] in 7 days of uptime [18:07:52] YuviPanda: https://bugs.openvz.org/browse/OVZ-5660 [18:08:18] Interestingly related. Mentions KVM has having the same issue [18:08:55] Maybe not; that specific bug seems to have been fixed in 3.0 kernels [18:09:51] YuviPanda: Aha. http://www.paranoids.at/debian-jessie-as-kvm-guest-high-cpu-load/ [18:10:13] "Simple solution was to install backports kernel 4.2.0-0.bpo.1-amd64 oder compile fresh vanilla kernel via make localyesconfig. [18:10:13] [18:10:13] Seems to be a debian kernel bug." [18:10:26] the k8s hosts are already on 4.2.0-0.bpo.1-amd64 [18:10:30] the proxies aren't [18:10:36] and the virt host? [18:10:41] virt host is [18:10:45] 3.16.0-45-generic [18:10:47] and trusty [18:11:05] proxies are 3.19.0-1-amd64 [18:11:22] I can upgrade the proxies to 4.2 [18:11:32] That blog post doesn't say if the bug was in the host or the guests though [18:12:13] after reading the post five times: I think they are using a Jessie VM under an ubuntu 14 / arch host [18:12:18] yeah [18:12:20] so it's about the kernel of the guest [18:12:41] !log tools upgraded kernel and rebooted tools-proxy-01 [18:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:12:52] let's see when it comes back up [18:13:53] this still doesn't count for the k8s nodes [18:13:57] which are already on 4.2 [18:14:18] Yeah, smells like an unrelated (if real) issue. [18:14:23] * andrewbogott is back from lunch but has nothing to contribute [18:14:54] doing a failover [18:17:00] !log tools failed over active proxy to proxy-01 [18:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:17:29] going to give it 20min for puppet runs everywhere then reboot [18:17:44] and in that interval also try to get out of this warm and cozy and lazy blanket... [18:17:50] brb [18:18:15] btw, EVERYONE we’re about to replace the ssl cert on wikitech. So there may be hiccups and/or interrupted sessions [18:22:26] YuviPanda: andrewbogott: aware of the tools-home issue? [18:22:31] 502 Bad Gateway [18:22:34] um [18:23:20] looking [18:23:42] > 2015/12/22 18:23:28 [error] 566#566: *80498 connect() failed (111: Connection refused) while connecting to upstream, client: 68.180.230.167, server: , request: "GET /geohack/mr/43_23_N_1_05_W_type:city?pagename=Masparraute HTTP/1.1", upstream: "http://10.68.20.248:44301/admin/?502", host: "tools.wmflabs.org" [18:23:46] dammit [18:23:50] sorry, random IP [18:25:25] mutante: paravoid andrewbogott ok, it's just admin that's down [18:25:39] no [18:25:43] it's NFS that's down? [18:25:49] idk, proxy's up and nagf works so proxy is fine [18:26:28] all of them are 'connection refused' [18:27:31] ok, it's all connecting to wrong ports?! [18:27:34] where is 10.68.20.248 ? [18:28:08] mutante: are you asking re: the tools issue? [18:28:20] YuviPanda: the tools.admin webserver is running, tools-webgrid-lighttpd-1405 port 51970 [18:28:31] so ports are wrong [18:28:38] restarting tools admin fixed it for tools admin [18:28:50] let me restart all webservices. [18:28:55] andrewbogott: yes, that IP is in the paste from yuvi [18:28:57] mutante: [18:28:59] the host that refuses [18:28:59] | 21ba7af6-bde1-4f08-a16a-c3faa06b6ccc | tools-webgrid-lighttpd-1415 | tools | ACTIVE | - | Running | public=10.68.20.248 | [18:29:05] YuviPanda: You just failed over - is it possible the redises were not properly synced [18:29:09] ? [18:29:23] seems like yuvi got it already though [18:30:16] yeah that's what I Think happened Coren [18:30:41] * Coren wonders if there's a way to alert on that. [18:30:41] or maybe not, since I failed-over back to -02 and it still was producing errors [18:30:52] !log tools rescheduling all webservices [18:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:32:12] I think -01 had a stale copy, and the flipping made it the master, and -01 also got a stale copy, and boom? [18:32:22] redis is just the wrong tool for this job :( [18:34:10] everything's back [18:34:18] Coren: valhallasw`cloud can you check a few tools? [18:34:21] * YuviPanda does too [18:34:28] * Coren looks. [18:34:32] YuviPanda: is there a canonical representation outside of redis? How did you recover? [18:34:36] watroles is back [18:34:55] andrewbogott: it's pretty much stateless. I just restarted all webservices and they all got new ports and updated that info in redis [18:35:00] wsexport is happy, so is geohack [18:35:13] YuviPanda: ah, this is the tools proxy [18:35:23] I’m always confusing the two [18:35:37] qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj [18:35:43] YuviPanda: I can find no broken services atm [18:35:43] does it [18:35:45] ok! [18:35:48] tools-home was down [18:36:04] and the fix was what now? [18:36:04] andrewbogott: yeah, and for labs proxy there's an sqlite db [18:36:43] chasemp: I was failing over proxies figuring out high softirq load on them, and got hit with stale redis replication [18:36:52] gotcha [18:38:33] I'll write up a report later. [18:38:36] hmm [18:38:39] no [18:38:48] redis is still a bit out of sync [18:39:09] so two proxies 01 and 02, do they hit respective different redis instances then? [18:39:12] one of which is the slave? [18:39:40] yeah, so -01 was slave and -02 was master and I flipped them over. [18:39:53] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:03] hmm [18:40:27] so -02 is slave of -01 right now [18:40:36] which is fine and how it should be [18:41:13] ok [18:41:32] afa the proxies tho, does proxy-01 hit redis-01 and proxy-02 hit redis-02? [18:41:40] chasemp: they have redises on localhost [18:43:31] Re landing page timeout; it's up but sluggish. [18:44:53] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 965308 bytes in 9.750 second response time [18:49:57] everything sane? [18:50:26] * YuviPanda is reminded of the alt-text for https://xkcd.com/371/ [18:50:55] Everything looks kosher, if a little slower than usual. I don't think that's abnormal given the number of restarts in a brief period. [18:51:23] that tools home check is really just a canary for a million issues [18:52:05] yeah [18:52:48] could be: NFS, proxy, DNS, network, LDAP, bad code, magnus-decides-to-remove-all-his-tools-or-change-his-name, etc [18:53:07] is that in order of most likely to least :D [18:53:14] fawikisource is killing us with requests [18:53:25] that too [18:53:28] for weather checks [18:53:30] ofc [19:06:38] valhallasw`cloud: Coren andrewbogott chasemp I need to reboot proxy-02 to finish the kernel upgrade, and almost no traffic is on it, but I'll wait an hour or so anyway. [19:06:51] 4th time I'm saying 'am going to go afk now', maybe this time it'll be true! [19:06:52] k [19:06:59] YuviPanda: Noted. [19:08:31]