[00:00:57] hey, is nfs down ? [00:03:13] andrewbogott: it looks like nfs is stucking [00:03:19] or very slow [00:04:30] "become" lasts several minutes [00:04:41] anybody here? [00:08:05] yuvipanda: ^ [00:09:45] Opsen are aware and are looking into it [00:10:19] Reedy: Thanks :) [00:11:35] The load on the bastion is unreasonably high - there seems to be a runaway python job. [00:11:57] Ima try to lighten the load; but in the meantime dev.tools.wmflabs.org (the secondary bastion) works fine. [00:13:58] Looks to me like NFS is fine but the bastion is not [00:14:20] andrewbogott: Heh. Way ahead of you. There's a user hammering on it atm. [00:14:48] Im afk for a minute here so glad to hear it is localized :) [00:14:52] Coren: do you have the rights you need or do you need me to smash? (if I ever get a shell there, that is…) [00:15:04] andrewbogott: Coren smash! [00:15:14] great! [00:15:38] Load should be stabilizing soon. [00:15:56] iiuc it is bad manners to run a job that way on bastion [00:16:21] I think someone was doing something compression-heaving on dumps. Ima fire off a talk page message. [00:16:35] They don't seem to be on IRC [00:16:56] ... and they restarted it. [00:17:56] still stucks and sucks [00:20:31] doctaxon: I've killed the offending processes that were eating all resources. The instance will trash a bit more but should settle down. [00:21:03] Coren: I hope, there'll be no restarts any more [00:21:23] but thx a lot [00:24:57] marmick: hello! [00:25:02] andrewbogott: yup [00:25:08] Something you were just now running on the bastion was causing trouble [00:25:19] So we (that is, Coren) had to kill your process [00:25:22] i was merging files [00:25:28] is there a reason you can’t dispatch the job to the grid? [00:25:41] huge files. cat x y > z [00:25:43] Ah, there he is. :-) Sorry about the rough way I had to kill your jobs. [00:25:55] yeh i felt i was in the jungle [00:26:06] :) [00:26:13] marmick: The bigger issue was a python program running. [00:26:40] i see, one which takes about 20 min [00:27:00] i'll send it to the grid then [00:27:00] The problem isn't so much how long as how much resources it consumes. [00:27:10] So, yeah - thank you. [00:27:27] Feel free to remove the .profile with the loud message I just added in your home. :-) [00:27:31] so, Coren, I can do the other 'cat' with no problem, right? [00:28:11] marmick: what size files are we talking about? [00:28:13] marmick: That still consumes a lot of resources, especially if you cat to and from the same (NFS) filesystem. [00:28:27] marmick: Using /tmp would help a lot. [00:28:30] 5 GB + 1 GB... about 6,5 [00:28:44] i want to merge files and then delete the parts [00:28:56] should i send that into the grid as well with a .sh? [00:29:48] marmick: Sending off to the grid would be better, but doing operations on large files like this would be even better off NFS or at least on the scratch filesystem. [00:29:50] Coren: you mean, moving (mv) my files to /tmp and then cat'ing to my home th enew one? [00:30:19] Ideally, cat *to* /tmp, then fiddle with the file there, then only move it back when you are done. [00:30:43] fiddle... [00:30:46] Wait. [00:30:51] You want to: [00:30:58] cat x y >z [00:31:01] then rm x and y? [00:31:14] exactly [00:31:22] Okay, do this: [00:31:26] mv y /tmp/y [00:31:32] cat /tmp/y >>x [00:31:40] then rm /tmp/y [00:31:48] marmick - pay my off time! [00:32:02] doctaxon: ¿? [00:32:12] This way you won't read and write x multiple times. :-) [00:32:12] couldn't do anything [00:32:38] ok Coren thanks :) [01:01:07] Krinkle: I'm interested in using Intuition for a Tool Labs project. The documentation mentions that it is centrally installed. Does that mean I should use a different path for the require_once, rather than "__DIR__ . '/vendor/autoload.php';" (and not install it locally)? [01:02:11] kaldari: Using the central path is discougared. For new tools do not use /data/project/intution/src/Intuition/InitTool.php. [01:02:25] kaldari: Yes, you should add it to composer.json in your own repo and commit the i18n there as well [01:02:34] and add it to translatewiki config in Gerrit [01:02:46] or, initially without twn [01:03:01] ok [01:03:05] Just need intuition in your tool via composer and an i18n directory containing .json files [01:03:23] got it [01:03:27] the documentation is updated for this use case, so you would indeed be using your own vendor/autoload [01:03:35] https://github.com/Krinkle/intuition/wiki/Migrate is also good [01:03:44] eventhoguh you're technically not migrating, it might help [01:12:17] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2058404 (10Nuria) Yurik, this is awesome [01:23:17] Coren: bastion is slow again [01:43:33] Krinkle: The documentation mentions that you can override the current language with parameter overrides, but doesn't explain how [01:44:15] kaldari: should be on https://github.com/Krinkle/intuition/wiki/Documentation [01:44:24] kaldari: What are you looking for? The query parameter is ?userlang= [01:44:30] and it also remembers in a cookie [01:45:03] Even if you don't use the central install, you can still use (and is by default) https://tools.wmflabs.org/intuition/ as the preferences panel [01:45:14] so that the cookie is shared with other tools and the user likely has their language set already [01:45:22] there is a ?returnto redirect option [01:45:28] Used by the footer link [01:45:36] userlang is what I was looking for, thanks! [01:45:54] should have read further [01:46:02] kaldari: You can change the query parameter name with the 'param' option to the Intuition constructor [01:46:07] if you want to [01:46:10] cool [01:46:30] In most cases though, it'll be the default language or from the cookie [01:46:41] it also uses Accept Language header as fallback when no cookie is present [02:35:08] 6Labs, 10Labs-Infrastructure: labvirt1006 disk space alert - https://phabricator.wikimedia.org/T127840#2058624 (10Hydriz) @Andrew So yep, pick a window that is most convenient for you and we will pick up from there :) [03:20:05] https://test2.wikipedia.org/wiki/Help:Lua_development_environment suggests a non existent TestUtils.lua to provide 'toframe' . the redlink seems to be because this content was cloned from http://scribunto.wmflabs.org/ which is now defunct. Google's not finding it either. Can anyone help me to get this so I have a semi-decent environment to test some mediawiki lua code. [03:20:27] more worrisome, does this suggest that Lua never caught on in mediawiki, and I should be afraid of using it [03:20:36] no, lua is huge on mediawiki now [03:20:53] comforting to hear [03:21:16] (also, to whoever is going to try to retrieve that, i think it was saved at Module:TestUtils) [03:22:23] as the page says: "advanced Lua scripting in MediaWiki is still a tedious process". I need to do some of this dev, and am just looking for a reasonably sane (and perhaps popular) dev environment. typically I work in emacs at a unix prompt, and avoid gui ides. [03:27:47] PROBLEM - SSH on tools-webgrid-lighttpd-1206 is CRITICAL: Server answer [03:28:52] jackmcbarn I don't get the sense anyone is going to try to retrieve that. can you suggest who/where to make a specific request? [03:30:33] in here is the best place i know of to ask for help with that, and personally, i find the textarea that you edit pages with to be fine for writing lua code [03:30:55] that page you're reading is rather old (note that it's on test2.wikipedia.org and not on a real wiki), and scribunto has improved a lot since then [03:37:27] http://snpedia.com/index.php/Special:Version runs a latest scribunto, but I need to work out some nested loops, and working in a browser, let alone a textarea is far from ideal. If lua is now popular on wikipedia I have to imagine some nicer envs are in use, but I don't see anything about local dev/test on https://www.mediawiki.org/wiki/Extension:Scribunto [03:38:19] there's some other extensions that Wikipedia runs to make it better, mainly CodeEditor [03:54:03] I'll give it a try [03:54:22] and reask here during usa awake hours [03:54:33] thanks for the pointers [03:54:49] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2058678 (10Beetstra) @valhallasw - I had to move the bot to another instance, it is now on 1205 (if I become linkwatcher I can't ssh to 1209, access denied). [04:08:01] 6Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Make error pages mobile friendly - https://phabricator.wikimedia.org/T119830#2058693 (10Dispenser) 5Resolved>3Open `/style.css` still has a `body { min-width:600px; }` and a two column layout at 320px widths, so not responsive yet. [04:17:15] (03CR) 10Ricordisamoa: "Hmm... I don't see get_magic_numbers() used anywhere?" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [04:27:48] RECOVERY - SSH on tools-webgrid-lighttpd-1206 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [04:45:12] 6Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Make error pages mobile friendly - https://phabricator.wikimedia.org/T119830#2058721 (10bd808) >>! In T119830#2058693, @Dispenser wrote: > `/style.css` still has a `body { min-width:600px; }` and a two column layout at 320px widths, so not responsive ye... [05:52:32] jackmcbarn if you're still here, or anyone else capable. I'd be grateful for a bit of guidance on using the lua editor effectively [05:53:31] http://52.17.3.10/index.php?title=Rs112039851&action=edit shows the calling page [05:54:01] the intermediate template http://52.17.3.10/index.php?title=Template:ClinVar&action=edit just passes it on [05:54:19] to the lua module http://52.17.3.10/index.php?title=Module:ClinVar&action=edit [05:55:18] I understand how to use the debug console with =p to see the definition of the table, and =p.ClinVar to see the function [05:56:16] but not how to pass in variables in a way that's consistent with the behavior as seen from the page->template->lua [05:56:26] ie, I'd like to be able to say [05:57:34] =p.ClinVar(ALT=A,C|CAF=0.9726; 0.02736) [05:58:06] but am unclear on how to represent the params to the the function call [06:57:57] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [07:33:05] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [08:34:51] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [08:45:46] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2058863 (10Tobi_WMDE_SW) @Yurik This looks great! [09:14:59] 6Labs, 10Tool-Labs: Webservice job stuck in dr state (bsaut on Tool Labs) - https://phabricator.wikimedia.org/T127933#2058914 (10Danmichaelo) [09:27:06] 6Labs, 10Labs-Infrastructure: labvirt1006 disk space alert - https://phabricator.wikimedia.org/T127840#2058948 (10Nemo_bis) I think the migration already happened: the instance was rebooted few seconds after my last comment here. [09:52:17] 6Labs, 10Labs-Infrastructure: tools-webgrid-lighttpd-1206.eqiad.wmflabs hanging - https://phabricator.wikimedia.org/T127936#2059021 (10valhallasw) [09:52:36] 6Labs, 10Tool-Labs: Webservice job stuck in dr state (bsaut on Tool Labs) - https://phabricator.wikimedia.org/T127933#2059035 (10valhallasw) 5Open>3Resolved Force-deleted the job -- see {T127936} for more info. [11:04:52] 6Labs, 10Tool-Labs, 13Patch-For-Review: setup-tomcat does not work - https://phabricator.wikimedia.org/T118094#2059154 (10zhuyifei1999) Apparently at least tools-webgrid-generic-1401 has `tomcat7-instance-create`: ``` 11:02:37 0 ✓ zhuyifei1999@tools-webgrid-generic-1401: ~$ which tomcat7-instance-create /us... [11:10:35] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059161 (10Magnus) [11:11:53] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059176 (10Magnus) p:5Triage>3High [11:12:46] (03PS2) 10Zhuyifei1999: setup-tomcat: Add to install list & Change queue [labs/toollabs] - 10https://gerrit.wikimedia.org/r/272699 (https://phabricator.wikimedia.org/T118094) [11:53:58] good morning eastern europe. Would/can someone from wikimedia-labs provide a copy of the now offline http://scribunto.wmflabs.org/index.php/Module:TestUtils . It is mentioned at https://test2.wikipedia.org/wiki/Help:Lua_development_environment as able to provide a 'toframe' method which would be very helpful for developing some lua code from the command line, instead of having to do all of the editing through a textarea. [11:54:15] dang, I meant western europe [12:01:27] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059161 (10jcrespo) "catscan2 seems to go down every few hours" is too vage. Can you provide details about what fails? Do user request fail? If they do, what queries where they doing? I do not hav... [12:16:57] 6Labs, 10Tool-Labs, 10DBA: Tool Labs queries die - https://phabricator.wikimedia.org/T127266#2059268 (10jcrespo) I have my own scripts to control wild queries, however, not every user uses that syntax (which has not been ever documented in our servers) and I was strongly suggested not to impose per-user limi... [12:34:22] 6Labs, 10Tool-Labs, 10DBA: Tool Labs queries die - https://phabricator.wikimedia.org/T127266#2059315 (10jcrespo) Let me propose you something, is this specific issue solved or do you still have issues? After that, we can create a new ticket/mail thread to discuss a new tools policy that everybody agrees with. [12:49:39] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059362 (10Magnus) Well, I don't have many more details, other than what is already on T127066. * server-status at https://tools.wmflabs.org/catscan2/server-status * server-stats at https://tools... [12:54:06] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059390 (10Magnus) So the DB error is our all-time favourite: There was an error running the query [MySQL server has gone away] So I'm now adding a new DB connection before EVERY SINGLE QUERY, a... [13:08:12] 6Labs, 10Labs-Infrastructure: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059424 (10jcrespo) Labsdb1002 server went down at midnight UTC, the night from Sat, Feb 14 to Mon, Feb 15. If you have some time, chatting on IRC would be more interactive, but it is ok if you c... [13:42:39] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059561 (10jcrespo) So, regarding other questions: * idle connections are closed after 5 minutes on replicas. This is to avoid reserving resources (done on connection) if th... [13:47:18] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059595 (10Magnus) catscan2 seems to hold its own for the moment; maybe the proper log/fail and DB reconnect does help. So could the commons switchover, thanks for that. Jus... [13:50:18] 6Labs, 10Labs-Infrastructure: tools-webgrid-lighttpd-1206.eqiad.wmflabs hanging - https://phabricator.wikimedia.org/T127936#2059599 (10chasemp) p:5Triage>3High [13:52:18] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059603 (10jcrespo) No more unknow column on heartbeat- it was a non-issue due to the s2 production master failover. Aside from fining sometimes lag created by user processe... [13:53:37] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059604 (10jcrespo) c1: ``` MariaDB LABS localhost heartbeat_p > SELECT * FROM heartbeat; +-------+----------------------------+------+ | shard | last_updated... [13:54:18] 6Labs, 10Labs-Infrastructure: I/O on labmon1001 is very slow - https://phabricator.wikimedia.org/T127957#2059611 (10MoritzMuehlenhoff) [13:59:00] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#2059628 (10chasemp) [13:59:02] 6Labs, 10Labs-Infrastructure: tools-webgrid-lighttpd-1206.eqiad.wmflabs hanging - https://phabricator.wikimedia.org/T127936#2059626 (10chasemp) 5Open>3Resolved node was locked up, I pulled this form the console: {F3412117} so yeah, more of the "can't resolve" the NFS server errors. I never was able to... [14:09:59] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059666 (10jcrespo) Magnus: I know labsdatabases are not in the best moment right now, I am trying to get resources from where there is very little. I am working on getting... [14:13:41] (03CR) 10ArthurPSmith: "ah, that's something that I added for the next round of changes - highlighting the magic numbers was requested here: https://www.wikidata." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [14:21:33] Steinsplitter, I am killing some of your queries on commons, they are getting pileed up (6 queries, all the same) [14:22:28] as you can see on the backlog here, there were some complains on performance replicas, you are probably not the main source, but it is contributing to it [14:28:19] thanks jynus [14:30:35] do you know how to apply https://gerrit.wikimedia.org/r/#/c/272965/ ? [14:30:55] or is it not taking effect because tools have to be restarted? [14:31:53] jynus: puppet run on labservices1001 [14:32:18] well, I suppose that runs regularly? [14:32:23] jynus: it does! [14:32:30] so no change [14:32:34] so, yeah, ‘wait a few minutes’ is a valid approach [14:32:51] jynus: ok, I don’t know what that’s about. I’ll have a look soon [14:33:13] I cannot login to labs, can you check that loging to commonswiki_p really redirects to labsdb1001? [14:33:28] I mean tools labs here [14:33:56] valhallasw`cloud, yuvipanda, we’re considering moving some of the Opsy talk out of this channel and into #wikimedia-labs-admin. Feel free to join us there. [14:34:03] jynus: would you like to be able to log in to labs? [14:34:11] ‘cause that’ll take like 2 minutes [14:37:52] andrewbogott, to tools, yes [14:38:05] jynus, what’s your username on wikitech? [14:38:30] and, do you have a keypair configured for labs already? [14:39:08] if not, drop your public key here: https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [14:39:32] 6Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Make error pages mobile friendly - https://phabricator.wikimedia.org/T119830#2059769 (10Dispenser) ```/* Small screen linearize */ @media only screen and (max-width:600px){ body {min-width:inherit;} .col2 { text-align:center;border-top:solid; } .c... [14:49:00] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2059815 (10Magnus) First, thanks for trying to keep the engine running despite tight resources! :-) catscan2 is now running for as long as it was in my initial post here, bu... [15:02:29] 6Labs, 10Labs-Infrastructure: labvirt1006 disk space alert - https://phabricator.wikimedia.org/T127840#2059901 (10Andrew) 5Open>3Resolved Yep, done and all is well. Thanks for your help, all! [15:04:01] 6Labs: New project for job candidate tests - https://phabricator.wikimedia.org/T127970#2059909 (10jcrespo) [15:07:44] 6Labs, 13Patch-For-Review: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#2059953 (10Andrew) This just happened again. So I guess this is happening... more often, now that the config is fixed? [15:14:51] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [15:14:53] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [15:15:59] PROBLEM - Puppet failure on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:16:16] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [15:17:02] PROBLEM - Puppet failure on tools-elastic-01 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [15:17:32] PROBLEM - Puppet failure on tools-puppetmaster-01 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [15:18:18] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:18:24] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [15:19:58] PROBLEM - Puppet failure on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:20:23] ok, all, I /believe/ that there was a few-minute dns outage and that things are now resolved and/or resolving. If anyone is still seeing failures please tell me! [15:20:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [15:20:44] (not you, shinken) [15:21:01] yeah my ssh abilities came back (i.e. bastion can resolve now) [15:21:17] this is really getting to be troublesome [15:21:19] yeah [15:21:29] not to mention I’ve already fixed almost all the possible causes [15:21:41] but I guess if it happens more it makes it easier for me to know when I fix the right thing :) [15:46:33] 6Labs: New project for job candidate tests - https://phabricator.wikimedia.org/T127970#2059909 (10chasemp) @jcrespo. So here is what I've got so far. A project named `ops-db-candidates` where I've spun up a VM called `db-candidate-trials-1`. Labs accounts are approved automatically IIRC so $candidate should... [15:48:19] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: Tools still shaky; DB replicas to blame? - https://phabricator.wikimedia.org/T127940#2060057 (10jcrespo) > This might be a case for the "Community Tech team"? Maybe. It sounds similar or related to what they told me they wanted to do when I asked for Tools h... [15:51:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:21] andrewbogott, chasemp: do either of you know how to kill orphans in k8s? [15:52:28] RECOVERY - Puppet failure on tools-puppetmaster-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:52:31] ..honestly I don't [15:52:38] There are 2 copies of grrrit running but k8s only seems to knoww about one of them [15:52:43] valhallasw is probably most likely too but I think he is busy atm [15:52:48] bd808: really no idea :( [15:52:59] my k8s mojo is only cut-n-paste from https://wikitech.wikimedia.org/wiki/Grrrit-wm [15:53:05] bd808: we've had this before -- I'm also unsure how this happens [15:53:26] RECOVERY - Puppet failure on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [15:53:34] and there's no easy way to figure out where it's running because NAT [15:54:37] I guess I can kill the one that k8s knows about and then either the other will keep running or it will die too and I can start it all up again [15:54:45] yeah, that's what I did last time [15:54:50] the other one died after a few hours [15:54:53] :/ [15:54:58] RECOVERY - Puppet failure on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:54:58] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [15:54:59] k. I'll give it a shot [15:55:12] let us know how it turns out! :) [15:55:56] RECOVERY - Puppet failure on tools-webgrid-generic-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:39] !log tools.lolrrit-wm 2 bots running & only 1 known by k8s. Killing it [15:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [16:00:47] 6Labs: New project for job candidate tests - https://phabricator.wikimedia.org/T127970#2060120 (10chasemp) A few more details as discussed on IRC. If you can create your puppet role to setup the test then we make it available to instances in this project in OSM and it should be pretty gtg. Let me know and I'll... [16:01:25] jynus: which querys? There was no roblem for years. They are older than your wikimedia job :o [16:01:38] valhallasw`cloud: The one known pod is still in terminating state but its bot seems to be down [16:01:40] so if and when grrrit-wm dies somebody will have to start it manually -- https://wikitech.wikimedia.org/wiki/Grrrit-wm [16:02:43] Steinsplitter, we have lower resources than ever, and probably more load [16:03:09] there was some commons contention [16:03:26] jynus: whick querys? likely cron must be disabled as well. [16:06:11] jynus: can you tell me which queries? [16:06:22] SELECT DISTINCT gil_wiki,gil_page_title,gil_page_namespace,gil_to FROM commonswiki_p.globalimagelinks WHERE gil_to IN (SELECT DISTINCT img_name FROM image WHERE img_user_text="XXXX") [16:06:38] I was searching them, chill [16:07:08] RECOVERY - Puppet failure on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:01] now I am not sure if you are s51203 [16:09:11] sec [16:10:04] changing venue, back in 20 [16:10:07] jynus: /me is s51916 [16:10:34] jynus: 51203 is tools.glamtools (I tempfixed https://tools.wmflabs.org/contact/ ) [16:10:54] ah, glamtools is magnus [16:10:55] the phab api I was abusing no longer exists :-) [16:11:06] ah, so same user that complained :-) [16:11:17] :-D [16:11:42] sorry, too many numbers and I got them mixed [16:12:10] following up with him [16:12:43] it happened the same the other day with Yuvi account noumber [16:32:00] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2060251 (10hashar) I have tried again to snapshot a running instance. On labnodepool1001.eqiad.wmnet as nodepo... [16:48:37] !log tools.stashbot Bot down; restarting [16:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master [16:51:38] !log tools.stashbot Bot restarted; missing SAL data since 2016-02-21T14:39 [16:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master [17:01:13] hashar: want to give your nodepool exercise another try? [17:03:34] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2060481 (10Andrew) That patch is scattershot, but I changed the defaults so if there were secret policies prev... [17:03:56] andrewbogott: good morning [17:04:11] * andrewbogott waves, dozes off [17:04:27] I am retrying [17:04:46] sorry for the very long replies on the task I wasn't sure how much information to share so I have been rather inclusive [17:05:31] recreating [17:07:58] | status | active | [17:08:03] andrewbogott: magician you are [17:08:13] so i guess it was lacking some kind of permissions somehow [17:08:28] No idea which permission, I wasn’t very scientific about this [17:08:44] but I don’t mind how things are set now, so we can probably just declare victory unless you’re extremely curious [17:10:34] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2060491 (10hashar) Gave it a try and it works fine now: | status | active And `$ openstack image... [17:10:45] andrewbogott: na no need to waste time on that imho [17:10:52] ok, great :) [17:10:57] andrewbogott: though having a nicer policy that does not require glanceadmin would be great [17:11:09] i.e. allow it if role::admin AND tenant::contintcloud [17:11:18] but I have no idea if that can be expressed in openstack policy rule [17:11:34] It probably can be, but I think I prefer having it in a role rather than being user-specific [17:12:08] but you still had to mark nodepoolmanager as a glanceadmin havent you? [17:13:33] yes, but… that seems correct to me [17:13:42] ok ok :-} [17:13:52] The ‘normal’ openstack model is to allow all users to create/install/delete custom images [17:13:58] I dont want you to have to inject hack that will ends up being some kind of tech debt later on [17:14:17] nodepool is recreating an image. We will soon know whether it works all fine [17:14:27] yeah, that makes sense. In this case I think it’s the right solution. And it’s well within the metaphor of keystone so I don’t think it’ll be hard to maintain. [17:14:41] all good so [17:15:11] so today I have learned that the 'openstack' CLI tools can dump stacktrace (--debug) and accept different level of verbosity ( -v -vv -vvv ...) [17:15:24] which happens to eventually dump the low level curl requests and response headers [17:15:26] quite useful [17:16:50] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, 13Patch-For-Review: Nodepool can't refresh snapshot on labs since ~ Feb 15th - https://phabricator.wikimedia.org/T127755#2060509 (10hashar) 5Open>3Resolved ``` $ nodepool image-update wmflabs-eqiad ci-jessie-wikimedia 2016-02-2... [17:17:00] andrewbogott: imho it is all good / success kudos! [17:17:14] hashar: yeah, the old nova, glance, keystone client commands are deprecated. Someday I need to rewrite all of the docs :( [17:19:12] andrewbogott: the unified cli tool is pretty much straightforward at least [17:19:19] yeah [17:19:20] nice subcommands tree [17:19:23] help everywhere [17:19:27] debug / verbosity etc [17:19:39] 6Labs, 10Tool-Labs: tools.taxonbot and tools.giftbot cronjobs not firing - https://phabricator.wikimedia.org/T123186#2060531 (10Giftpflanze) 5Open>3Resolved Resolved for both of us. [17:19:42] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060533 (10MZMcBride) [17:21:01] andrewbogott: how can i get attention for https://phabricator.wikimedia.org/T127494? [17:22:10] gifti: unfortunately I’m not sure we have a policy yet for designating a tool as officially abandoned. Are you unable to contact any of the former owners? [17:23:02] i will contact drtrigon, this will take approximately one or two weeks until he responds [17:23:31] ok — clearly the easiest/least controversial path is to get an existing admin to add you. [17:23:41] yeah, i know [17:23:52] just takes a bit [17:25:25] maybe tools labs can use a takeover policy of some sort ;D [17:25:48] we definitely need a policy [17:26:50] there's an rfc even … [17:27:40] 6Labs, 6WMF-Legal: Craft a policy for seizing abandoned tools and projects - https://phabricator.wikimedia.org/T127994#2060590 (10Andrew) [17:27:52] gifti: really? link? [17:28:06] * andrewbogott is not fully engaged today [17:33:02] um, let me look for it … [17:35:28] andrewbogott: https://phabricator.wikimedia.org/T87730 and https://meta.wikimedia.org/wiki/Requests_for_comment/Abandoned_Labs_tools (but it is dormant) [17:35:37] thank you! [17:36:06] 6Labs, 6WMF-Legal: Craft a policy for seizing abandoned tools and projects - https://phabricator.wikimedia.org/T127994#2060642 (10Andrew) [17:36:09] 6Labs, 10Tool-Labs, 6Developer-Relations, 6WMF-Legal: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2060643 (10Andrew) [17:46:27] 6Labs, 10Tool-Labs: Add new maintainer to tools.drtrigonbot and tools.asurabot - https://phabricator.wikimedia.org/T127494#2060670 (10Giftpflanze) Because there's no policy in place yet, per @Andrew, I tried and sent an e-mail to DrTrigon, who hopefully will respond. [18:01:58] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060710 (10Luke081515) p:5Triage>3High [18:03:00] 6Labs, 10Tool-Labs: Labs users should be able to force-delete their own jobs - https://phabricator.wikimedia.org/T127681#2050398 (10Luke081515) I'm not sure, but I can use -f as non admin at my tool. Is this really disabled? [18:03:27] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:11:35] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060748 (10Andrew) It's responsive now, this seems to be coming and going. When the bastion seizes up it is typically the result of someone running a super-expensive job that eats up all the resources. Maybe -05 is... [18:30:44] PROBLEM - Free space - all mounts on tools-worker-1002 is CRITICAL: CRITICAL: tools.tools-worker-1002.diskspace.root.byte_percentfree (<10.00%) [18:32:19] 6Labs: Need to reset two-factor authentication for wikitech account - https://phabricator.wikimedia.org/T127999#2060794 (10JAufrecht) [18:44:47] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060840 (10MZMcBride) Yeah, it seems better now. [18:46:13] 6Labs, 10MediaWiki-extensions-OpenStackManager: Additions and removals of project members are very hard to decipher in the wiki diff - https://phabricator.wikimedia.org/T128001#2060846 (10scfc) [18:47:49] Why is tools-bastion-01 down? [18:47:55] luke081515@bastion-01:~$ ssh tools-bastion-01 [18:47:55] ssh: connect to host tools-bastion-01 port 22: No route to host [18:48:42] Luke081515, use tools-login.wmflabs.org [18:48:54] I just wondered, the others are reachable [18:49:02] it is now pointing to number 5, not sure about "the long story" [18:49:04] normally I'm using that ;) [18:49:09] ah, ok [18:49:22] I am a labs user now :-) [18:49:32] seems like bastion-05 has issues too https://phabricator.wikimedia.org/T127992 :-/ [18:49:35] :) [18:49:46] but I have not mysql access :-( [18:50:19] as WMF Database admin? :D [18:50:30] as a labs user [18:50:42] I mean in generally ;) [18:51:35] sometimes some permissions just sound funny, for example andre has neraly full admin access to phabricator, but is not able to edit reposoritys [18:51:38] (at phab) [18:51:47] of course I do, as an admin I have [18:51:53] ah, ok :) [18:52:00] yes, same [18:52:04] Otherwise this sounds very strange :D [18:52:31] I probably have the option to overpass labs admins, but that doesn't mean I should [18:52:49] is as if I edited the wiki directly from the database [18:52:56] that is a big no [18:54:14] both ethically and technically [18:54:41] yeah, agreed [18:55:16] but sometimes this is needed, acutally I did this a few days ago, because CA gots a bug, and without database changes I could not create global groups [18:56:42] 6Labs, 10MediaWiki-extensions-OpenStackManager: Lists of users in Labs project pages should be sorted by wiki user name - https://phabricator.wikimedia.org/T128002#2060881 (10scfc) [19:00:29] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060909 (10Multichill) I noticed @marcmiquel running big jobs on the bastion the other day that ran on 95%+ CPU for hours. He should probably read the grid engine manual ( https://wikitech.wikimedia.org/wiki/Help:Too... [19:04:43] 6Labs, 10Tool-Labs: tools-bastion-05 is super slow - https://phabricator.wikimedia.org/T127992#2060950 (10marcmiquel) My excuses, I run a script in the bastion when I should have instead sent it to the grid. They noticed me this in the IRC channel and now it is clear. [19:04:53] Luke081515: becasue bastion-01 was taken down (see andrews mail from last week) [19:05:01] ok [19:19:18] did something just happen to bastion? stuff is really slow [19:20:15] like, NFS-type slowness [19:20:22] 05? [19:20:30] yeah [19:21:30] my guess is someone running some large job where they shouldn't [19:21:34] at the moment it's frozen up [19:21:39] ugh [19:22:53] marcmiqu pts/26 62.83.214.27.dyn 19:02 7:51 15.76s 15.59s mv enwiki_3229084.tsv /tmp/ [19:22:53] marcmiqu pts/31 62.83.214.27.dyn 19:05 2:12 2.10s 1.90s mv enwiki_anon_az.tsv /tmp/ [19:23:59] mv over filesystem? ;( [19:25:57] Earwig: I'll try to clean up here but in teh meantime I think bastion-02 is ok [19:26:03] thanks [19:37:59] PROBLEM - Puppet failure on tools-elastic-03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:46:06] PROBLEM - Free space - all mounts on tools-bastion-05 is CRITICAL: CRITICAL: tools.tools-bastion-05.diskspace.root.byte_percentfree (<16.67%) [19:46:30] !log tools runonce deployed for https://gerrit.wikimedia.org/r/#/c/272891/ [19:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:50:49] 6Labs, 13Patch-For-Review: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#2061206 (10scfc) It would still be interesting to know //what// happens at 15:00Z that triggers these outages. The fixed time doesn't sound like the DNS server goes belly up after x hours running or y req... [19:56:13] RECOVERY - Free space - all mounts on tools-bastion-05 is OK: OK: All targets OK [20:03:34] (03Abandoned) 10Youni Verciti: Initial Check-in & Html [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271763 (owner: 10Youni Verciti) [20:12:57] RECOVERY - Puppet failure on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:42:30] (03PS1) 10Youni Verciti: Initial check-in [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/273060 [20:44:16] (03Abandoned) 10Youni Verciti: Initial check-in [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/273060 (owner: 10Youni Verciti) [20:57:44] RECOVERY - Free space - all mounts on tools-worker-1002 is OK: OK: All targets OK [20:59:45] 6Labs, 10Continuous-Integration-Infrastructure, 6Operations, 10puppet-compiler, 7Puppet: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#2061479 (10hashar) [21:00:34] 6Labs, 10Tool-Labs, 10puppet-compiler: toolsbeta: set up puppet-compiler / temporary-apply - https://phabricator.wikimedia.org/T97081#2061482 (10hashar) [21:03:58] (03PS1) 10Youni Verciti: Add the public_html folder to easily update the html code [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/273096 [21:07:32] hashar: Description and projects of https://phabricator.wikimedia.org/diffusion/OSPC/ ok, or should I change them for you? [21:09:08] Luke081515: hello! I have replied on the task asking for #puppet-compiler project [21:09:34] Luke081515: I got confused when you talked about sub-projects because at that time I have never ever heard of such feature in Phabricator ;-D [21:09:46] hashar: Hi! Yeah, but you didn't mentioned the repo :D [21:09:51] as for the description of OSPC I am not even a maintainer of that repo [21:10:06] ok [21:10:13] it has the link to the project so that looks fine to me [21:10:48] the main reason was to tag cards / lookup issues easily [21:11:15] hashar: Yeah, I'm adding projects to repos from time to time.... only about 700 repos with the word "extension" left.... I didn't count the other unflagged repos :-/ [21:17:44] Luke081515: that is a daunting task :( [21:17:55] maybe that can be scripted [21:18:02] or mass done directly in the db [21:18:31] the problem is: How can this script detect the right projects? [21:18:45] not every repo has a project [21:19:28] you can probably get most extensions done [21:19:35] #MediaWiki-extensions-{name} [21:19:49] I remember auto-filing bugs that way and only had to do like 30 out of 200 manually [21:24:52] Hi, [21:24:57] legoktm: But there are a lot of repos without a project: https://phabricator.wikimedia.org/diffusion/query/Coh9RZPrbkiD/#R [21:25:20] bd808: You got an answer, sadly that didn't work as expected [21:26:19] did bastion just go down ? [21:26:37] i was kicked out and can log back in [21:26:41] matanya: normal bastion or tools bastion? [21:26:48] tools [21:26:53] tools-login.wmflabs.org [21:27:09] I have commited changes from my station in gerrit but tue tool vocabulary-inde account's files are not updated. [21:27:15] connecting is slow [21:27:25] but you can use a workaround [21:27:43] matanya: Go to normal bastion, and ssh to tools-bastion-02 [21:27:45] that works [21:27:50] ok, thanks [21:27:55] I will file a bug [21:28:08] or connect to tools-dev.wmflabs.org to connect to -02 directly [21:28:22] Is it a good idea to fetch from the tool to gerrit in this case? [21:28:35] valhallasw`cloud: I thnink bastion-05 is the problem, connection times out [21:28:48] valhallasw`cloud: Should I file a bug too? [21:28:57] Yes, please. [21:29:03] ok [21:29:15] thanks, that worked [21:30:47] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061655 (10Luke081515) [21:30:53] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061669 (10Luke081515) p:5Triage>3Unbreak! [21:30:59] alhallasw`cloud: ^ [21:34:43] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061677 (10Luke081515) For affected users: Temporary workaround: Connect to tools-dev.wmflabs.org to use the actually working bastion-02 instead of -05. [21:35:14] I added your information, valhallasw`cloud, I think this is for users useful, who are affected [21:36:50] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061655 (10doctaxon) but it works very fine to me (tools.taxonbot) [21:37:43] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061693 (10Luke081515) WFM now too, seems like this was just a temporary issue. Needs investigation or close this? [21:37:59] maybe close it [21:38:23] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061655 (10Multichill) I'm logged in and I could open another session without a problem.... [21:39:06] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061697 (10doctaxon) ya, no problems, too [21:47:33] (03PS1) 10Matanya: initial commit [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273116 [21:48:01] (03CR) 10Matanya: [C: 032 V: 032] initial commit [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273116 (owner: 10Matanya) [21:53:16] 6Labs, 10Labs-Infrastructure, 10Tool-Labs: tools-bastion-05 is unreachable - https://phabricator.wikimedia.org/T128026#2061714 (10Luke081515) p:5Unbreak!>3Triage [21:54:50] RECOVERY - Puppet failure on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [22:08:26] PROBLEM - Puppet failure on tools-bastion-mmtemp is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [22:12:17] (03Abandoned) 10Youni Verciti: Add the public_html folder to easily update the html code [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/273096 (owner: 10Youni Verciti) [22:15:16] Is it possible to change the OAuth callback URL once an OAuth consumer has been created? [22:18:18] tom29739: not at the moment [22:18:29] PROBLEM - Host tools-bastion-mmtemp is DOWN: CRITICAL - Host Unreachable (10.68.17.61) [22:18:31] OK. I can work around it. [22:18:54] you can just make a new consumer, same name, larger version number [22:19:40] tom29739: if you mean if it's possible to have a dynamic URL, then yes [22:19:52] but the consumer has to be registered that way [22:20:35] I checked the option for dynamic URL. how do you use it? [22:23:07] (03PS1) 10Matanya: add SULwatcher [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273122 [22:23:36] (03CR) 10Matanya: [C: 032 V: 032] add SULwatcher [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273122 (owner: 10Matanya) [22:25:22] hi guys, [22:25:22] i'm sending a script to the grids with the following code: [22:25:23] “sort enwiki_1_final.tsv | uniq > enwiki_1_final_new.tsv" [22:25:24] i want to delete 10GB in a 40GB which are just repeated. [22:25:26] but the process dies because of space: [22:25:28] "sort: write failed: /tmp/3787973.1.task/sortYfkuGG: No space left on device" [22:25:30] what should i do? it seems that tmp is the working memory place for sort,… but tmp is full. [22:25:48] yuvipanda, chasemp, andrewbogott? [22:25:58] /tmp is small [22:26:34] It limited to something GB. I forget how much. Something like 2GB I think [22:26:56] 6Labs, 10wikitech.wikimedia.org: SRF preference messages broken - https://phabricator.wikimedia.org/T128027#2061756 (10Krenair) [22:27:01] tom29739: more more, but still, too little to work 'sort' with this 40GB file [22:28:05] I think there's an argument for sort. Something like --buffer-size. That would solve your problem [22:29:30] so you think reducing the buffer could avoid the space problem? [22:30:02] but still we are talking about sorting this 40GB of rows [22:30:24] i don't know exactly how the algorithm works, but it might need space [22:31:13] Ah, https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html [22:31:21] ‘--buffer-size=size’ [22:33:14] i'm not sure if it works for larger files [22:33:23] "this option affects only the initial buffer size. The buffer grows beyond size if sort encounters input lines larger than size." [22:34:09] It was being talked about the other day [22:34:15] I'll check the logs [22:35:53] ok [22:36:55] tom29739: set the callback parameter [22:37:17] it's part of the oauth standard so if you use a library it will probably have an option for it [22:38:33] marmick, are you doing it on the bastion or on sge? [22:38:46] tom29739: i'm sending it to the grids [22:38:50] is sge the grids? [22:39:00] Yeah [22:39:13] on bastion i would kill the cpu and annoy other users, which i try not to [22:39:34] marmick: reading 40GB from NFS and then writing 30GB to NFS isn't going to win you any friends [22:39:53] where should i do it then? [22:39:56] marmick, the grid [22:40:00] jsub [22:40:04] why do you need to de-dup the file? Can we get it de-duped upstream? [22:40:25] set -T /tmp [22:40:27] i am doing it in the grid, as i said [22:40:37] /tmp is about 15GB [22:40:45] You could try /srv [22:41:42] is the sorted uniq file valuable or just an intermediate step in the work you are doing? [22:41:58] * bd808 is thinking of the NFS server kitties [22:41:58] Why do you need to do sort? [22:43:50] i read that the uniq is for deleting repeated rows [22:44:04] but you should pipe it from a sort [22:44:18] maybe that's not necessary, or it's additional useful? [22:45:03] marmick: I guess I'm asking "what work are you doing? and is it possible to do with less disk I/O?" [22:45:20] bd808: it's intermediate. from the 40GB i create another dataset, which is the one i apply statistical methods. i thought of deleting repeated rows in the later processing [22:46:00] but then i should keep track of the rows (or at least some trace of several rows, which is for instance username) [22:46:18] in a dictionary. this would imply keeping in a dictionary 185 million names or sth like that [22:46:31] which i think it might kill the process because of out of RAM, could it be? [22:46:53] bd808: all this is for my volunteering PhD on wikipedia editor engagement [22:47:44] where do you get the original data set from (the one that is unordered and includes duplicate data)? [22:48:16] (03PS1) 10Matanya: add readme [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273125 [22:48:34] (03CR) 10Matanya: [C: 032 V: 032] add readme [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273125 (owner: 10Matanya) [22:49:12] bd808: mysql, from where i create .tsv, which i process [22:50:12] does the sql select take a long time to run? Depending on the tables the data is coming from it may be much more efficient to let mysql sort and de-duplicate the data [22:50:39] sql is for that sort of thing if there are indexes in the right places [22:51:08] yes, they take a lot to run. sometimes it dies before giving any result [22:53:48] 40GB is obviously a lot of data to pull out of the mysql cluster. Can you partition the search more granularly? Like if you are looking a cross-wiki things can you get it out one wiki at a time or if you are looking at a date range can you break it down into smaller discrete chunks? [22:54:54] bd808: i'll try this. then i can delete repeated lines in a posterior processing [22:56:17] marmick: if you'd like help figuring out the data pipeline (which is probably the boring part of your work) I'd be willing to help you think of approaches. We may end up with some patterns we could document on wiki that would help others trying to work with large datasets too. [22:56:47] bd808: thanks for offering :) [22:57:08] We have a lot of docs about the technical bits of running software but not a lot of things about algorithms and techniques for working with the actual data [22:57:34] well, now i'm more concerned on getting done with the data. i'm part-time working while doing the phd, so i do this at night and i'm quite running out of time with deadlines [22:57:47] this is why i've done more intensive use of the labs lately [22:57:54] and annoyed a bit other users :S [22:58:04] but i think your idea is great. i'll take notes [22:58:17] abusing shared resources will certainly get people's attention ;) [22:58:33] bd808: haha [22:58:40] but we all do it from time to time whether purposfully or accidentally [22:58:44] i wish my results were good enough to compensate ;) [22:58:48] i'll see [22:59:28] in my case it was accidentally. because i have to do a lot of things now in a sudden and coming across all the problems together [22:59:41] *nod* [23:00:18] are your scripts and queries published somewhere I can see them? [23:00:29] nope, i'll do it later [23:00:49] now i'm obtaining all the data [23:01:02] after the writing i'll tidy up the scripts a bit and upload them to git [23:01:25] that's the part that is interesting to me (getting the data) and I doubt your scripts will be the worst I've ever seen [23:01:35] they are full of comments in catalan, which i don't see interesting for other coders :) [23:01:46] * bd808 has written some truly horrible code [23:02:06] my code is not terrible. i wish it was more modular sometimes [23:02:25] but it can certainly be much better [23:25:24] going to sleep. bye! [23:25:31] thanks bd808 [23:25:41] o/ marmick [23:32:27] marmick: we can try to stand up another bastion for you run through things on there is also I tihnk tools-precise-dev.tools.eqiad.wmflabs [23:32:36] which is meant for heavier jobs running on the bastion itself iirc [23:32:42] seems pretty under used [23:32:44] that may be ideal [23:34:33] chasemp: aw he left before seeing that [23:35:00] ah his name still autocompleted but barely I guess :) [23:35:06] ok well note for the future [23:35:48] I really wanted to get him to cough up his data collection scripts so I could see if there was a nicer way to get him the data [23:35:57] same [23:36:15] brute force things that work on 2M datasets certainly don't scale to 40G most times [23:36:52] yeah he was doing like 24G in place sorts on NFS but I think it's a bit better problem is [23:37:01] temp location for big chunks of data we are short on it [23:37:18] depending on what is happening it's hard to judge feasibility from what I've seen [23:41:03] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2061952 (10chasemp) [23:45:22] RECOVERY - Puppet failure on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:44] 6Labs, 10wikitech.wikimedia.org: Need to reset two-factor authentication for wikitech account - https://phabricator.wikimedia.org/T127999#2061967 (10Peachey88)