[00:12:03] Earwig: ping [00:12:21] If you're there, Raymie in #wp-en needs your help [00:18:36] !log tools shut down redis on tools-redis-01 [00:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:24:13] except wikibugs is dead [00:24:15] yay [00:28:25] 6Labs, 6operations: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1907554 (10yuvipanda) Actually, I guess the confusion is sorted out, so maybe this should be closed? [00:44:49] so, uh, was anyone interested in a combined list of every botanist in every wiki language? [00:45:18] i.e. anyone in Category:Botanists on every site, or any of the subcategories on every site [00:58:26] Thomas Jefferson � American gardeners [00:58:42] i guess that's kind of a botanist [01:01:41] Johnny Miller � Golf course architects [01:02:39] maybe i need to narrow my definition of botanist [01:44:05] (03PS30) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [02:24:18] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Arifys was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=242665 edit summary: [02:40:13] Elapsed time: 2209.1107701 seconds ... i might need to work out how to reuse database connections or something [04:02:51] OH-: pong, is he still around? [04:03:06] Let me check [04:03:23] Earwig: Yes, he's still there [04:03:30] can you tell him to PM me? [04:04:03] Ok, thanks :) [11:14:47] Is there any way to limit the number of operation per user on Tools Labs? [11:14:57] Someone is currently overloading wsexport [11:17:26] @seen YuviPanda [11:17:27] zhuyifei1999_: I have never seen YuviPanda [11:17:35] @seen YuviPanda [11:17:35] zhuyifei1999_: YuviPanda is in here, right now [11:49:27] Tpt: yes/no; we can block an IP from the whole of tool labs, but not per-tool [11:57:45] valhallasw`cloud: ok [11:58:01] Tpt: which requests do you mean? [11:58:21] there seem to be multiple users requesting wsexport data [11:58:34] if you have a look to the wsexport tool connection log [11:59:05] you will see that someone is downloading the same kind of books [11:59:14] every around 10 seconds [12:00:46] The comédies et proverbes one? [12:00:52] yes [12:01:10] its also éLe Général Dourakine" [12:01:22] and "François le bossus" [12:01:54] nope, those are different users [12:02:21] well, different IPs, anyway. [12:03:53] valhallasw`cloud: thank you! But that's stange, it's a set of book from the same author and they keep trying to download them since a few hours [12:04:07] the IPs are all from China, though. Hrm. [12:04:23] I also don't get why they keep re-requesting the same URL over and over again [12:04:59] I don't see too except the fact that these books are big so it's a good way to overload the servers [12:06:06] there's something else odd: sometimes the response is 4 kB, sometimes it's 131 kB [12:07:28] maybe the 4 kB is a bad error without any nice HTML and the 131 kB ones a nice error page [12:08:21] Yeah, could be. Somehow it's a 200 response code, so I'm a bit confused. [12:13:00] can you find where the page is linked from? [12:15:02] nothing in the referer, unfortunately [12:17:37] :-( [12:18:19] is there any caching done to prevent this possible ddos attack? [12:22:31] the user-agent string also suggests it's some sort of scraper (Chrome, more than a year old) [12:29:44] 6Labs, 10Tool-Labs: Chinese scraper (?) with multiple IP addresses overloading wsexport - https://phabricator.wikimedia.org/T122582#1908033 (10valhallasw) 3NEW [12:29:54] now let me figure out how to ban using the user-agent [12:35:38] Tpt: should be better now [12:36:03] 6Labs, 10Tool-Labs: Chinese scraper (?) with multiple IP addresses overloading wsexport - https://phabricator.wikimedia.org/T122582#1908049 (10valhallasw) Banned via https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ATools&type=revision&diff=242815&oldid=242273 [12:38:49] 6Labs, 10Tool-Labs: Apply pretty 'banned' error page to user-agent bans - https://phabricator.wikimedia.org/T122583#1908055 (10valhallasw) 3NEW [12:43:37] valhallasw`cloud: Thank you very much! [12:44:01] I've restarted the tool to kill all the running request and the tool seems to work well now [15:07:53] !log labs salt instances salt update in progress. It's slow and tedious and automated. A few hundred instances already done, the rest are going one at a time. Only instances that use the labcontrol salt master will be affected. [15:07:54] labs is not a valid project. [15:07:59] well meh [15:08:07] so it's all projects [15:08:12] how am I supposed to log that? [15:08:34] anyways here's the deal, it's ongoing, any given project might see a given minion not respond for a minute or so [15:08:36] that's all [15:08:55] these packages fix a number of performance issues [15:19:57] apergos: there's no central "labs" sal, I think. I suppose that stuff just goes into the ops sal. [15:20:13] already there [15:21:12] Cool. The heads-up here (even if labslogbot didn't understand it ;)) is much appreciated [15:22:40] :-) [15:22:43] we're only in the b's [15:23:03] each update is about 5 sshes [15:23:20] so it will take the time it takes but I don't have to care about it, it just goes on about its business :-) [15:23:58] after this is all done I'll do beta [15:31:32] how easy is get a domain for a tool? [15:31:44] example .wikimedia.org [15:33:15] The_Photographer: nearly impossible [15:33:30] petan: great [15:33:42] why would you want a domain like that? [15:33:51] you can get a domain with .wmflabs.org suffix, that is doable [15:33:58] or you can just have tools.wmflabs.org/yourtool [15:34:15] because is more easy for the user [15:34:36] it's a security issue if you own subdomain of any production site [15:34:53] so I don't think that anyone would give you that [15:38:01] The_Photographer: Also, there is an issue with proxying and SSL certificates in a case like that [15:38:55] I underestand [15:42:15] AlexZ: Are you active in Wikimedia Commons? [16:04:34] in the d's now [16:40:22] I need a project member of Nova_Resource:Phabricator, someone here? [16:43:30] where I can get wikimedia commons css? [16:47:35] Luke081515: I'm not a member of the project but I have some wikitech super powers. What do you need help with? [16:48:23] The_Photographer: Is this what you are looking for? -- https://commons.wikimedia.org/wiki/MediaWiki:Common.css [16:48:51] bd808: I need the file at webroot/rsrc/image/sprite-menu.png at phab03, to change it at my phab instance, so that I don't have to use the wikimedia logo [16:48:55] The_Photographer: there is also -- https://commons.wikimedia.org/wiki/MediaWiki:Vector.css [16:49:17] It's the phabricator logo, to I don't have the problem with using wikimedia logos at labs instances [16:49:29] thanks bd808 [16:50:57] Luke081515: you can probably get it from the normal phabricator source? [16:51:35] hm, I can take a look [16:51:36] Luke081515: https://github.com/phacility/phabricator/blob/master/webroot/rsrc/image/sprite-menu.png [16:51:47] ah, ok, thanks [16:51:59] thanks valhallasw`cloud I was on my way to find that link :) [16:57:02] I need do a streess test to my tool, somebody could help me? [17:10:21] bd808: however, where is defined the class css description ? [17:12:07] The_Photographer: I'm not understanding your question. Can you rephrase or elaborate? [17:12:24] bd808: I am sorry. [17:12:55] bd808: My question is. Where I could find the css class definition for "description" class used in wikimedia commons page [17:15:03] The_Photographer: hmm... that's a good question. [17:17:12] The_Photographer: I'm looking at https://commons.wikimedia.org/wiki/File:Small_bird_perching_on_a_branch.jpg using firebug to search the css. I'm not finding any explicit css rules for the "description" class [17:17:53] The_Photographer: view the relevant html tag in your browsers devtools? [17:18:02] i.e what bd808 did [17:18:25] bd808: look the DOM [17:19:00] The_Photographer: just because the class is used in html doesn't mean it's actually defined in css [17:19:09] [17:19:19] I see td, div and span tags with the class applied, but I don't see any specific styling [17:20:45] valhallasw`cloud>: I changed the image, and restarted apache, but the old logo is still displayed. Do you think I need to restart the instance, or did I made something wrong? [17:21:06] Luke081515: there might be a build step involved [17:21:28] because it's actually embedded as base64 in the html, it seems [17:21:43] so you probably need to build core.pkg.css [17:22:01] in any case, a forced refresh (ctrl-shift-R) might heml [17:22:02] help [17:23:14] bd808: I need this css [17:24:02] The_Photographer: there is no css [17:24:28] more specifically, that class is not defined in css [17:24:38] its defined in js? [17:24:50] no, it's not defined at all [17:25:03] let's take a step back. what are you trying to do? [17:26:01] Luke081515: there is a comment on a related upstream task about needing to purge a cache -- https://secure.phabricator.com/T4214#99869 [17:26:26] ah, thanks, bd808 :) [17:27:28] hm, displays still the old logo :-/ [17:29:59] Luke081515: reading more on that upstream feature request makes me think that twentyafterfour has monkey patched our css somewhere to show the Foundation logo [17:30:10] valhallasw`cloud: I need the css definition of "description" class [17:30:28] valhallasw`cloud: to use this css in another page in my tool webservice [17:30:55] The_Photographer: no, that's 'how', not 'what' [17:31:11] bd808: I wonder, because the task at WMF phabricator was closed, by the proposed solution to change this file: https://phabricator.wikimedia.org/T117235#1829893 [17:38:13] 6Labs, 10Tool-Labs, 5Patch-For-Review: Limit webservice manifest restarts - https://phabricator.wikimedia.org/T107878#1908335 (10Ricordisamoa) If the webservice fails 3 times with the same error it should not be restarted again. [17:38:46] 6Labs, 10Tool-Labs: Limit webservice manifest restarts - https://phabricator.wikimedia.org/T107878#1908337 (10Ricordisamoa) [17:41:10] Luke081515: I'm sure there is some little trick we are missing. You may need to track down twentyafterfour (or maybe chasemp) to get some hints on how the various css assets are built [17:41:31] ok, I will ask him [17:45:08] Luke081515: maybe this? -- https://secure.phabricator.com/book/phabcontrib/article/adding_new_css_and_js/#changing-an-existing-fil [17:45:31] I can try it [17:46:02] bd808: Great :D solved the issue [17:46:11] \o/ [17:48:08] Coren, before you leave, can you tell me where I can find the content history of a page in the database? [17:48:27] I will add a note at the task if someone else have this problem later [17:48:31] Luke081515: thanks! [17:48:43] Cyberpower678: why do you need Coren specifically for that? [17:48:54] Because I know him the best. [17:49:28] I' very foggy on who else does Labs maintanaince. [17:49:28] As far as the replicas are concerned, the answer is 'you can't', as the content is not available in the replicas; you'll have to use the dumps or the api for that [17:49:48] dumps are out, and so is the api [17:50:29] The API is too slow and clumsy. [17:51:08] Dumps aren't updated often enough. [17:51:37] valhallasw`cloud, is there perhaps a way to ask the api to determine at what timestamp text was added to a page? [17:51:55] not that I know of [17:51:58] Dammin [17:52:12] I'm not sure why the api would be slow though -- it's unlikely direct database access would have been faster [17:52:26] No, but I can get what I need faster. [17:52:53] Without having to download the entire page history to get it. [17:53:53] I would have simply asked the database to populate the revisions of a page that had the text in it, and give me the smallest timestamp. [17:54:08] and how would you ask the database? [17:54:31] valhallasw`cloud, something like SELECT * FROM table WHERE CONTAINS(Column, 'test'); [17:54:40] that's a very good way to get an angry DBA [17:54:51] DBA? [17:54:56] database admin [17:55:03] Why? [17:55:58] because it's the same amount of work, but now you're doing it on the database server which has limited resources [17:57:02] The work isn't the issue, it's the persistent having to download the data into memory over the internet, that's problem. The work for that can be done very quickly. [17:57:21] All you have to do is restrict it to a specific page. [17:59:07] I have my doubts. In any case, it's not very relevant, as the information is not available in the database to begin with. [17:59:16] A shame. [17:59:39] I was working on a very, hopefully, low impact query. [18:00:05] All I need is a timestamp. One timestamp [18:00:52] So the idea was to restrict the query to a page, and then have it run the check until it hits the first positive. It returns that timestamp. [18:01:50] so get the page history from api.php and parse it? [18:02:17] I'm already doing that. How much memory, time, and bandwidth that costs? [18:02:25] *Do you know how [18:02:37] A lot, especially for big articles. [18:02:58] I've crashed a couple of times in the process. [18:04:51] Cyberpower678: there are some tricks you could use. If you got the list of all the revisions for the page in question then you could do a binary search of those revisions to find the one you want. [18:05:07] valhallasw`cloud: look what I am trying to do. https://tools.wmflabs.org/wikiradio/index.php?channel=USA and click in "More details" [18:05:19] valhallasw`cloud: however the details css is not defined [18:05:21] bd808, what do you mean? [18:05:48] valhallasw`cloud: in commons its defined in some place [18:05:52] bd808, I'm trying to find out at what point in time specific text was added to a page. [18:07:03] Cyberpower678: start with revision N, then go back to N/2. If it was there already, go back to N/4, if it wasn't there yet, go to 3N/4, etc [18:07:17] Cyberpower678: https://en.wikipedia.org/wiki/Binary_search_algorithm , or see e.g. git bisect [18:07:59] The_Photographer: 'the details css'? do you mean the font used there? [18:08:20] valhallasw`cloud: yes [18:08:21] valhallasw`cloud, I should now add some details to what I'm doing. I need to sweep the history more than. Several times per page, I'm looking for different text each time. [18:09:10] The_Photographer: if you use your browsers devtools, you can see where the font-family is defined for a specific part of the page [18:09:10] in the m's now [18:09:38] The_Photographer: in this case, it's in @media screen; html, body [18:10:09] Cyberpower678: right, so then you're back to just downloading the whole batch. [18:10:31] Do you see my predicament? [18:11:04] Whole batches aren't a viable solution. Not when you're trying to do 5 million in a reasonable amount of time [18:11:37] you want to do a full text search of all revisions of all pages in "a reasonable amount of time"? [18:11:40] that suggests your problem might just not be solvable in a reasonable amount of time [18:11:59] at least, not without preparing a database structure specifically for such a search [18:12:04] bd808, yes, yes I do [18:12:24] valhallasw`cloud, I am. But the information needs to be compiled first. [18:12:26] valhallasw`cloud: yes, its loaded dynamically from a php, however, I need it static [18:13:04] The_Photographer: https://commons.wikimedia.org/wiki/File:Small_bird_perching_on_a_branch.jpg?debug=true [18:13:18] should load the css per file, iirc [18:13:43] so https://commons.wikimedia.org/w/load.php?debug=true&lang=en&modules=mediawiki.skinning.interface&only=styles&skin=vector [18:13:49] I suppose I could probably do binary searches for all the texts I'm searching for in one go. That would make for a complicated search algorithm. [18:14:23] sorry, https://commons.wikimedia.org/w/load.php?debug=true&lang=en&modules=skins.vector.styles&only=styles&skin=vector [18:14:24] valhallasw`cloud, another question then. How do I determine N/2 for example [18:15:11] valhallasw`cloud: I could use this link directly? [18:15:16] The_Photographer: no [18:15:47] Cyberpower678: so what you do is that you use next_test = (last_before + last_after)/2 [18:16:13] I meant revision wise. How do I tell the API that? [18:16:23] doing tools* at the same time cause bored [18:16:26] Can you give me an example? [18:17:01] Cyberpower678: you can get a list of revisions in advance, or you can use a heuristic, e.g. 'the first revision with a timestamp bigger than (time_before + time_after)/2' [18:18:07] Cyberpower678: https://bpaste.net/show/32271c8bcbc9 is what I use to bisect the SGE accounting log [18:18:54] where the heuristic is 'the first full line after byte (byte_before + byte_after)/2' [18:24:10] valhallasw`cloud, out of curiosity, what column would hold the content data? [18:24:28] ? [18:24:34] Back to the DB [18:24:41] in revision [18:25:27] Cyberpower678: the text table, *if* the text is stored in the database (which iirc is not the case for WMF wikis) [18:25:44] see https://www.mediawiki.org/wiki/Manual:Revision_table [18:27:47] Cool [19:00:07] p's now (plus still doing tools) [19:01:40] what is a popcorn editor? [19:01:56] I ask because I am watching all these curious instance names go by on my screen [19:02:50] mozilla's popcorn editor [19:02:52] apparently [19:03:31] which doesn't exist anymore? #confused [19:03:40] video editor, I think [19:07:08] yeah, its a video editor that Brian is/was messing about with [19:11:50] oh, I almost completely forgot about that [19:12:24] maybe we should mail reminder emails once a month like mailman does aobutlists [19:12:42] you are the admin of these projects with these instances. to delete them, click here :-D [19:13:02] probably just get filtered into people's wmf spam box [19:14:31] valhallasw`cloud: I added manually the css [19:21:41] 6Labs, 7LDAP: Restore ldaplist -l passwd - https://phabricator.wikimedia.org/T122595#1908530 (10Andrew) 3NEW a:3MoritzMuehlenhoff [19:23:55] Error writing to output file - write (28: No space left on device) [IP: 208.80.154.10 80] [19:23:56] [19:23:59] another fine instance [19:24:12] I do believe cleanup after this will call for a bit of alcohol [19:36:06] andrewbogott: http://tools.wmflabs.org/contact also allows searching by uidNumber, but it's not as conveniently grep/sed-able [19:36:50] valhallasw`cloud: I also suspect that that changed limit is breaking the LdapAuthentication extension [19:39:34] Mmmm. Yeah, that could be. It's not entirely clear to me what is limited by ldap (and in some sense, I find it weird a 'give all results' is considered more costly than 'filter results by this *wildcard* query') [19:41:59] valhallasw`cloud: Because the server doesn't try to parse the query to decide whether it needs to filter at all. [19:47:07] 6Labs, 7LDAP: Restore ldaplist -l passwd - https://phabricator.wikimedia.org/T122595#1908604 (10Andrew) [20:31:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:43] again [20:33:57] hi [20:34:11] the tools home page seems fine to me [20:34:45] slow to me [20:34:47] that main page has been slow for a long time :( needs redoin to be not so big [20:36:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 971987 bytes in 5.168 second response time [20:41:53] gah what happened on dec 7 so that instances that had accepted salt keys with the proper hostname (name,project.eqiad.wmflabs) also got a bad salt key added (name.eqiad.wmflabs)? [20:42:38] it's a nice mix of instances so not just one project either [20:44:24] hi Labs experts: http://mwui.wmflabs.org/w/index.php is returning a 500. I don't have access, maybe someone around could help here further?! [20:45:04] Volker_E: do you know what project it is in? [20:45:40] YuviPanda: Define 'project'? (I'm pretty new to labs) [20:45:55] Volker_E: const project; [20:46:11] :) [20:46:41] YuviPanda: let me try it again: I have _no clue_ [20:47:03] ah. do you know who was maintaining it before? [20:49:16] YuviPanda: prob werdna or prtksxna [20:49:51] Volker_E: ok, in that case I think one of them has to grant you access [20:50:51] There are a wmflabs dev, quality and productions enviroments? [20:51:33] or its simply a shared crazy enviroment? [20:52:35] I don't understand your first question, The_Photographer but yes it's a bit crazy [20:54:15] YuviPanda: https://en.wikipedia.org/wiki/Development,_testing,_acceptance_and_production [20:55:17] ah, quality as in QA [20:55:31] yeah, no. some people create a 'toolname-dev' tool to act as testing [20:57:01] however everything is in the same server? [20:57:16] nope, we've about a 80 nodes [20:57:30] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help has more info [20:57:55] same enviroment, however X numbers of virtual servers [20:58:57] yeah [21:01:43] YuviPanda: werdna is not here any more, hope prtksxna can help further [21:01:44] YuviPanda: thanks! [21:02:54] yw [21:15:21] 6Labs: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1908856 (10yuvipanda) 3NEW [21:15:25] 6Labs, 6operations: Increase timeout for tools-home check - https://phabricator.wikimedia.org/T122615#1908866 (10yuvipanda) [21:22:13] YuviPanda: is there a way to look at the web proxy data and tell what project/instance has a given proxy configured? [21:22:30] (if not would it be bad to build something that exposed that?) [21:23:08] bd808: you can look at redis directly but yeah not elsewhere [21:23:37] would be nice to have a way to do that there's a bug for it [21:24:35] https://phabricator.wikimedia.org/T115752 [21:25:41] bd808: yeah. next quarter's goal involves providing proxy setup on Horizon, I want to do a full redo and cleanup [21:26:02] bd808: and possibly switch to vulcand [21:26:08] rather than our homemade redis+lua seutp [21:28:52] is that redis instance visible from labs? [21:29:05] bd808: nope, it's firewalled off [21:29:09] bah [21:29:21] since otherwise anyone could read/write routes to say whatever :) [21:29:26] yea [21:29:35] bd808: the proxies themselves are labs instances [21:29:46] bd808: I can grant you access to that project if you wish [21:30:29] I... probably shouldn't fall in that rabbit hole today [21:30:30] :) [21:30:30] wise choice [21:46:13] hi! is there any known issue with SGE right now? [21:46:30] i’m getting this error when i try to use qstat: [21:46:31] error: commlib error: access denied (client IP resolved to host name "tools-bastion-01.tools.eqiad.wmflabs". This is not identical to clients host name "tools-bastion-01.eqiad.wmflabs") [21:46:31] error: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs" [21:46:57] Toto_Azero: Not sure, that looks odd. [21:47:20] and my usual running scripts look down for like 1/2 hours [21:47:33] * Coren checks. [21:49:27] Something seems to have changed subtly with DNS reverse mapping; I'm looking at things now. [21:50:23] ok [22:04:14] anyone feel like kicking labs-dnsrecursor2.openstack.eqiad.wmflabs around a bit? right now pdns-recursor fails cause it can't even look up the ldap servers [22:04:33] since it can't look up anything at all it can't of course apt-get update, nor run salt or anything else [22:05:02] YuviPanda: you around? andrewbogott^ [22:05:14] yep, here [22:05:20] apergos: still? [22:05:23] soemthings up with nfs or at least there is a page [22:05:23] yeah [22:05:33] is there general dns weirdness that relates then? [22:05:37] chasemp: join us in -operations [22:05:37] I'm going through the list of 'no salt'... down to I think just 3 more instances [22:06:27] others were working just fine [22:06:28] apergos: I can just delete that instance unless you care about it [22:06:41] I do not at all [22:06:47] I don't know if someone does though [22:06:51] * apergos gets off it [22:07:14] if you do delete it, let me know so I can update my todo list. and thanks [22:11:03] I'm done I think with my pass over instances that don't have their own salt master [22:27:53] 6Labs, 10Salt: lab instances with broken salt which need to be fixed by instance owners - https://phabricator.wikimedia.org/T112512#1909025 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn this is now obsolete, we have a shorter list from a new salt upgrade so I can close this (most entries here are fine or... [22:28:39] 6Labs, 10Salt: clean up old ec2id-based salt keys on labs - https://phabricator.wikimedia.org/T103089#1909037 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn No more old i-000xxx names in my list of salt keys, it's very exciting! Closing. [22:34:23] considering that last time tiles.wmflabs.org broke we had moans of "omg you're breaking my site not related to Wikimedia", the question is: do we have it written somewhere that labs are for uses related to WMF projects only? [22:37:54] MaxSem: it's in the labs TOU, but that technically only applies to people who build stuff on labs, not to users: https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use [22:38:21] (front page says 'Labs Account Holders are governed by the Labs terms of use.', but that page suggests it applies to both) [22:42:07] so I have a few remaining instances which basically are the one dns recursor and a bunch of unreachable by ssh... [22:42:39] anyone care to look at them? it's these https://phabricator.wikimedia.org/T115287#1909017 except for the icinga instance which is ok [22:42:58] YuviPanda: or andrewbogott if you are not buried in ldap or nfs still [22:50:41] 10MediaWiki-extensions-OpenStackManager, 10CirrusSearch, 6Discovery: Searching for "Hiera:" with namespace "Hiera" deselected still shows results in "Hiera:" - https://phabricator.wikimedia.org/T110377#1909127 (10Deskana) p:5Triage>3Lowest [22:51:11] apergos: I don’t think I’m going to have a chance tonight, sorry [22:51:32] no worries [22:51:39] they can just bide their time [22:58:53] wikitech is slow at the moment, it takes a lot of time to reload the output for rcm-3 [22:59:43] Luke081515: probably fallout from all the ldap craziness going on [23:05:42] die grid nicht erreichbar? [23:06:56] Coren: error: commlib error: access denied (server host resolves rdata host "tools-submit.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [23:06:59] error: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs" [23:07:02] error: commlib error: access denied (server host resolves rdata host "tools-submit.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [23:07:05] Unable to run job: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs". [23:07:08] Exiting. [23:07:11] error: commlib error: access denied (client IP resolved to host name "tools-submit.tools.eqiad.wmflabs". This is not identical to client [23:07:14] s host name "tools-submit.eqiad.wmflabs") [23:07:16] error: unable to contact qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs" [23:08:40] hi doctaxon. we're looking into it (see /topic), probably LDAP related [23:08:41] ldap again [23:08:41] what are you doing so much with ldap? [23:08:42] patches to move off it to alternatives welcome [23:08:42] I gotta bow out of this channel cause if my network drops or freenode drops me I won't autojoin overnight [23:08:42] so... [23:08:45] see you in the other channels [23:22:45] !log tools restart gridengine-master on tools-grid-master [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:24:44] YuviPanda: are tehse teh same vm's that are not puppetized? [23:25:11] chasemp: they're puppetized now, mostly. [23:25:25] :) [23:26:22] error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error). [23:26:25] useful error message! [23:26:41] yeah, my errors spit the same [23:27:04] Coren: ^ around and any idea what that could mean? [23:27:11] qmaster is fineish, stracing shows it doing things [23:27:22] but also: error: commlib error: got read error (closing "tools-grid-master.tools.eqiad.wmflabs/qmaster/1") [23:31:45] !log tools rebooting tools-grid-master [23:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:36:08] YuviPanda: any luck? [23:37:03] nope [23:37:37] I saw most connections to master were from tools-submit (our cron setup) so I restarted that but still nothing's goin on here [23:37:58] and it isn't firewalls or anything [23:38:42] YuviPanda: I have to go to bed now, but try stracing qstat on tools-bastion [23:38:47] it does communicate with the master [23:38:48] but [23:39:08] sometimes read(3, 0x7fdb4045c000, 22) = -1 EAGAIN (Resource temporarily unavailable) [23:39:13] sometimes poll([{fd=3, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout) [23:39:30] yeah, it's pollin [23:39:30] actually, it seems to get stuck in that poll loop [23:39:36] but it has communicated before that! [23:39:46] with a response from the master,even [23:40:20] am running a fuller strace now [23:41:14] valhallasw`cloud: haha, it doesn't even output error now that I"m watching with strace [23:41:17] it's just hung forever [23:41:53] aha it does now [23:42:01] * YuviPanda waits for it to fully fail before looking at strace [23:42:56] YuviPanda: /data/project/.system/gridengine/spool/qmaster/messages_shadowd.tools-grid-shadow suggests NFS issues in any case ('got timeout error while read data from heartbeat file "heartbeat"') [23:43:24] but that's 25 mins ago [23:43:32] is there anything to restart if in theory nfs comes back? [23:44:06] NFS is back [23:44:08] YuviPanda: and there's almost 12400 files in /data/project/.system/gridengine/spool/spooldb, which I'm pretty sure is not right [23:44:09] and I restarted the masters [23:44:57] but that might just be another symptom ;/ [23:45:06] yeah... [23:45:44] and the spooldbs are also binary [23:45:46] great [23:46:12] YuviPanda: berkeley db [23:46:18] there's also not much interesting in there [23:46:51] YuviPanda: what's the play man? how can I help? I know nothing about this setup here [23:47:01] can we just rest all teh things and risk losing transient jobs etc [23:47:12] assuming nfs and ldap outages caused gridengine to go off the map [23:47:25] yeah, but a master restart should solve that [23:47:31] I already did that yeah [23:47:36] the heartbeat file is over 10min old [23:48:03] and it's supposed to touch it every 30s [23:48:27] chasemp: not sure, we're all kindof blindly-ish groping around-ish [23:48:43] 12/29/2015 22:06:25|listen|tools-grid-master|E|commlib error: local host name error (IP based host name resolving "tools-webgrid-lighttpd-1204.tools.eqiad.wmflabs" doesn't match client host name from connect message "tools-webgrid-lighttpd-1204.eqiad.wmflabs") [23:48:52] but that's also many hours ago [23:50:14] does that work now? [23:50:30] talk me through what I can look at, I don't know where the heartbeat file is or what touches it etc [23:50:38] for tha tmatter I don't know what the current state of things affects :) [23:50:45] chasemp: ok, so this is on /data/project/.system/gridengine on NFS on tools [23:50:57] chasemp: current state is that jobs that are running continue to run [23:51:01] just all interactions with it are broken [23:51:23] we can't use qmod etc? [23:51:25] yeah [23:51:29] qstat just hangs forever [23:51:41] becuase the master state is bad? [23:51:50] I assume the master is the focus of all coordination? [23:51:52] presumably. [23:51:54] yes [23:51:54] what are the master vms? [23:51:58] tools-grid-master [23:52:07] there's an strace of a failed qstat at /data/project/wikibugs/fuck [23:52:11] secondary? [23:52:18] tools-grid-shadow [23:52:59] YuviPanda: there is decent odds something here doesn't work right on boot eh? [23:53:07] YuviPanda: there's two masters running on -shadow? I suppose that's you testing? [23:53:10] have you gone through to see the master doing all teh amster things? [23:53:26] valhallasw`cloud: yeah but there are no masters [23:53:38] chasemp: I strace'd it and its eemed ok but that was before the restart [23:53:50] chasemp: 'master things' is like responding to qstat which it isn't doing [23:53:53] sorry, on -master [23:53:54] right [23:53:57] valhallasw`cloud: oh, no, that's weird [23:54:10] valhallasw`cloud: I see only one? [23:54:27] yeah, it's gone now. It was running under root, so I thought it must have been you? [23:54:41] strange, no wasnt' me [23:54:47] I wonder if that's puppet [23:54:54] * YuviPanda runs puppe [23:57:33] valhallasw`cloud: interesting. puppet tries to hit q* somehow [23:57:42] and hence fails ofc [23:58:03] YuviPanda: [23:58:08] so is it interesting there is a huge alias file [23:58:09] for dns [23:58:13] cat /var/lib/gridengine/default/common/host_aliases [23:58:18] guess it's not just me having problems [23:58:33] so, fwiw, the shadow master is doing very little. No fds open, and talks to master every now and then, so doesn't take over [23:58:39] that would in theory expose the above if nfs is down [23:58:51] i.e. it is doing some file based dns things from nfs [23:59:00] so nfs flakes and suddently it thinks nodes are the wrong nodes? [23:59:06] that sounds totally nuts but [23:59:10] chasemp: kinda, but we restarted master and NFS is back now [23:59:27] I'm just walking through the only failure we've actually pasted :) [23:59:30] YuviPanda: so I have a crazy suggestion. Stop master altogether and see if shado picks up? [23:59:40] valhallasw`cloud: ok [23:59:45] but there's something that just stopping doesn't make the shadow take over iirc [23:59:46] argh. [23:59:48] chasemp: yeah, https://phabricator.wikimedia.org/T109485 and related. it's crazy-ish [23:59:52] valhallasw`cloud: it theoretically should [23:59:55] ok [23:59:58] valhallasw`cloud: once the heartbeat file is 10m old