[01:23:59] 06Labs: Investigate alternatives to NovaProxy - https://phabricator.wikimedia.org/T119936#2563196 (10AlexMonk-WMF) I got Traefik building, and we realised that the API is entirely optional - you could just use the etcd backend and have Horizon write there. https://traefik-test.openstack.wmflabs.org/ [01:25:50] 06Labs: Build and package traefik - https://phabricator.wikimedia.org/T143294#2563198 (10AlexMonk-WMF) [03:00:11] 06Labs, 10Tool-Labs: DNS resolution sometimes fails on tools-bastion-03 - https://phabricator.wikimedia.org/T143194#2563261 (10Samwilson) Looking at this again this morning, the error is not occurring when the script is run on bastion-03, but only when it's run on the grid engine. Sorry, I got confused because... [04:33:12] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Uzume was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=817646 edit summary: [06:47:33] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [07:09:15] hi all, long time. I'm working on recitation-bot again, and I've got in a bit of a pickle before figuring out how uwsgi threading needs to work [07:10:31] webservice thinks my service is not running when checking status or stopping but thinks it's running already when starting [07:16:14] the job in SGE that is stuck is 9915064 [07:22:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [09:39:15] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:27:37] I'm looking for a way to create term vectors from wikipedia db table text with R tm corpus. Should I just import dumps into own mediawiki instance and query directly? [10:28:25] he1kki: the text table is not available on labs; importing dumps into a MW instance on labs would be a bad idea as well, we don't have the space for that [10:29:47] but the dumps are available in /shared/dumps, except they seem to have largely disappeared?! [10:30:01] oh, they are in /public/dumps [10:30:05] valhallasw`cloud: Ah yes, I meant my own mediawiki instance in my private servers [10:35:21] I feel tempted to just take biggest wiki's locally and query rest of them via wikipedia public api. [10:36:01] What is considered fair use when querying that text data from their public, non-privileged user [10:40:36] HaeB: if you need all pages from a wiki, the dumps are the best option [10:40:39] eh, he1kki [10:41:20] if you need a small subset, the api is fine -- just use reasonable rate limits and wait for longer if the server asks you to [11:21:34] it is not that much that the table text is not on labs, rather than the table text does not contain text on WMF :-) [11:23:16] as the access specifics are too complex, we need mediawiki inbetween to make sure we do not expose private data- hence the dumps or the api [14:57:07] (03PS1) 10Jcrespo: Fake passwords to mimic in labs the striker-database ones [labs/private] - 10https://gerrit.wikimedia.org/r/305512 (https://phabricator.wikimedia.org/T142545) [14:58:30] (03CR) 10Jcrespo: [C: 032 V: 032] Fake passwords to mimic in labs the striker-database ones [labs/private] - 10https://gerrit.wikimedia.org/r/305512 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [15:00:56] anyone know much about www.toolserver.org/~para/geoip.fcgi (which is a 301 redirect to geoiplookup.wm.o, which is an unforunately-public service that never should've been, which we're trying to kill now) [15:01:13] I have a proposed patch up to remove the redirect: [15:01:15] https://gerrit.wikimedia.org/r/#/c/305418/2/modules/toolserver_legacy/templates/www.toolserver.org.erb [15:13:42] paravoid: I think that's yours? ^ [15:14:08] no [15:15:18] I think it's https://commons.wikimedia.org/wiki/User:Para [15:15:24] *nod* [15:18:21] bblack: I'm a bit confused why the redirect needs to go, though. If geoiplookup.* serves a 403, this will just redirect there and everything is good? [15:19:25] it doesn't need to go, it can happily stay there while we take down the service (the hostname, too, so it will just fail to connect eventually) [15:19:37] I just figured it was cleaner to remove references to it where I can before killing it :) [15:20:22] there's only 3 known users from all our code repos: that toolserver redirect, and production's CentralNotice and ULS extensions (which need code changes to get away from it) [15:20:23] bblack: fwiw, I don't think there are any legit uses in the access log -- most of them (~15k requests over the last month or so) are from a 3rd pary wiki, the rest is negligible (few tens of hits with google.com as referer etc) [15:21:49] the code changes for the extensions would move them to relying solely on production's GeoIP cookie. and then we're just going to break the unknown number of other / 3rd party things which are hitting the JSON service. [15:21:50] I'll drop a note at that wiki that their geoiplookup is going away [15:22:08] (which never should've been publicly used by others, but people have discovered it and abused it anyways) [15:23:06] the abuse isn't entirely their fault, though. I don't think we've ever really done a good job discouraging it, and we may have even encouraged it at some point. [16:22:02] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:26:28] what happened to wikibugs? :( [16:26:59] good question [16:27:01] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [16:27:22] greg-g, ^ [16:27:44] ohai wikibugs [16:28:10] Krenair: did you restart it? [16:28:48] yes [16:29:57] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:34:56] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [16:47:24] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [16:51:03] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [17:03:43] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ninovolador was created, changed by Ninovolador link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Ninovolador edit summary: Created page with "{{Tools Access Request |Justification=running my bot for some pages my computer is too slow to process |Completed=false |User Name=Ninovolador }}" [18:04:41] legoktm do you know of any popular gadgets on wiki that hit tool labs end points? [18:04:51] yeah [18:04:52] Krinkle ^ [18:04:55] why? [18:05:18] someone is asking [18:05:22] (it's complicated) [18:05:34] https://en.wikipedia.org/wiki/MediaWiki:Gadget-BugStatusUpdate.js [18:05:51] for example [18:06:14] I don't remember if wdsearch still does. [18:07:36] do you need more? [18:07:59] yuvipanda: ? ^ [18:08:08] yeah [18:08:15] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:44:08] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2565438 (10yuvipanda) [18:53:06] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2565457 (10yuvipanda) p:05Triage>03Normal [19:03:59] legoktm: Several I can point you to, but always opt-in, and usually not on page load. [19:04:32] yuvipanda: ^ [19:04:49] https://github.com/Krinkle/mw-gadget-rtrc/blob/v1.3.0/src/rtrc.js#L34 [19:04:58] https://meta.wikimedia.org/wiki/User:Krinkle/RTRC.js [19:05:11] https://meta.wikimedia.org/wiki/User:Krinkle/Tools/WhatLeavesHere.js [19:07:19] tyvm, krinkle / legoktm [19:14:19] 06Labs, 10Labs-Infrastructure: Set up some sort of web pages at wmflabs.org or www.wmflabs.org - https://phabricator.wikimedia.org/T38885#2565498 (10AlexMonk-WMF) Yuvi also did https://gerrit.wikimedia.org/r/#/c/305540/2 You can now go to https://www.wmflabs.org. and https://wmflabs.org. DNS may be spotty unt... [19:37:18] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:46:50] HI, I requested the creation of a new tool around ~30 minutes ago and I do not see it either from the tool list (https://tools.wmflabs.org/?list) nor I can log in on tool-basstion and "become" the tool. What am I missing? [19:51:16] CristianCantoro: hmmm... it certainly shouldn't take that long. What's the new tool name? I can poke in ldap to see if its there [19:51:27] wscontest [19:51:38] thanks bd808 [19:54:08] CristianCantoro: it's at least partially created. cn=tools.wscontest,ou=servicegroups,dc=wikimedia,dc=org exists in ldap [19:54:30] I'll look at bit more and see if I can figure anything out [19:55:47] bd808: for the record if I try to become wscontest from bastion this is what I get: [19:55:48] --- [19:55:48] cristiancantoro@tools-bastion-03:~$ become wscontest [19:55:48] become: no such tool 'wscontest' [19:55:48] cristiancantoro@tools-bastion-03:~$ [19:55:49] --- [19:56:34] *nod* I get the same for `sudo become wscontest` [19:57:21] 06Labs, 10Labs-Infrastructure: Set up some sort of web pages at wmflabs.org or www.wmflabs.org - https://phabricator.wikimedia.org/T38885#2565619 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF alex@alex-laptop:~$ host wmflabs.org labs-ns1.wikimedia.org | grep addr wmflabs.org has address 208.80.155.... [19:59:00] CristianCantoro: let's open a phab bug about this to keep track of it [19:59:44] bd808: are you going to do it or should I? :) [19:59:44] bd808, I remember something like this a week ago [19:59:51] If you could start it that would be great so you get notified about progress. [20:00:04] !log tools restarted maintain-kubeusers on tools-k8s-master-01 [20:00:06] that was the cause [20:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:00:11] should be fixed now [20:00:18] ah ha [20:00:22] ldap connection to seaborgium was 'stuck' for about 2 minutes [20:00:24] at least [20:00:30] despite the fact I've a connection timeout there for 1s [20:00:32] >_> [20:00:49] confirmed that `sudo become wscontest` works for me now [20:01:06] CristianCantoro: ^ can you get in now? [20:01:07] the connections were all in CLOSE_WAIT [20:01:13] yep, just got in [20:01:18] awesome [20:01:29] thanks yuvipanda [20:01:47] yw. there is alread ya ticket for it somewhere [20:01:56] thanks yuvipanda :) [20:02:33] 06Labs, 10Labs-Infrastructure: Audit the labs infrastructure scripts that depend on LDAP to make sure they are resilient to failover - https://phabricator.wikimedia.org/T142394#2565670 (10yuvipanda) maintain-kubeusers was stuck connecting to seaborgium today for minutes, despite there being a 1s connection tim... [20:03:19] hmmm... close_wait means the remote sent a FIN and the local hasn't ack'ed it with a FIN [20:18:36] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2565732 (10yuvipanda) List of instances (that aren't in tools): ``` | e66353af-d66a-4bb0-854d-4116c8dbb9a1 | integration-slave-precise-1002 | ACTIVE | public=10.68.17.87 | | adda8daa-... [20:18:38] yuvipanda: ldap3's docs on the context manager claim that it will unbind properly when the wiht scope ends, but CLOSE_WAIT sounds like that is not happening [20:19:51] bd808 yeah... [20:20:01] bd808 strace just showed it stuck on a select [20:22:12] maybe you should throw a time_limit=N on your search? [20:22:30] the current time limit is jsut for connecting [20:23:17] !log mobile deleted test-labs-vagrant, apparently I created it a long time ago [20:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile/SAL, Master [20:23:23] bd808 oo,h didn't realize search had that [20:23:33] bd808 can you put it on the ticket and I'll get to it Soon(TM)? [20:23:35] I just looked it up :) [20:24:07] :D [20:26:56] 06Labs, 10Labs-Infrastructure: Audit the labs infrastructure scripts that depend on LDAP to make sure they are resilient to failover - https://phabricator.wikimedia.org/T142394#2533513 (10bd808) Adding a `time_limit=N` to conn.search in get_tools_from_ldap //might// help. The socket was connected and in CLOSE_... [20:43:15] 06Labs, 10Labs-Infrastructure: DIsable precise instance creation on horizon for all projects (except tools / deployment-prep) - https://phabricator.wikimedia.org/T143359#2565852 (10yuvipanda) [20:44:41] yuvipanda, the precise images are deactivated, so how can someone create a new precise instance? [20:45:22] 06Labs, 10Labs-Infrastructure: Ensure that precise instance creation is disabled everywhere (except tools / deployment-prep) - https://phabricator.wikimedia.org/T143359#2565867 (10yuvipanda) [20:45:26] retitled! [20:48:30] 06Labs, 10Labs-Infrastructure: Ensure that precise instance creation is disabled everywhere (except tools / deployment-prep / integration) - https://phabricator.wikimedia.org/T143359#2565870 (10yuvipanda) [20:56:56] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2565438 (10chasemp) Deprecating precise for CI will involve sorting out {T103786} as well > chasemp: if we are nearing 0 for precise in prod, what is it CI is testing on precise? > cha... [21:07:49] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2565438 (10AlexMonk-WMF) re deployment-prep instances: deployment-db1, deployment-db2 - these are in the middle of a migration to deployment-db03/deployment-db04, see {T138778} deploymen... [21:20:46] !log mobile Shut down staging.eqiad.wmflabs as a first step towards deletion. ref T143349 [21:20:47] T143349: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349 [21:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile/SAL, Master [21:36:27] 06Labs, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Nodepool instance instance creation quota management - https://phabricator.wikimedia.org/T143016#2554147 (10chasemp) yesterday we had some issues with CI and @thcipriani and I poked at it for a bit making some change... [21:54:34] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ninovolador was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=817758 edit summary: [22:32:27] 06Labs, 10Tool-Labs, 13Patch-For-Review: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2566258 (10chasemp) 05Open>03Resolved https://graphite-labs.wikimedia.org/render/?width=1000&height=519&_salt=1470062067.959&target=cactiStyle(tools.tools-bastion-03... [22:33:14] 06Labs, 10Tool-Labs, 13Patch-For-Review: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2566262 (10chasemp) 05Resolved>03Open On second thought, reopen as we should move this to services host or grid master after a bit of stability