[00:09:54] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [00:24:50] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [00:35:51] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:53:02] which node or instance do i have to run puppet (compiler) on to show that changes to the gridengine module dont break it [00:53:43] tools-exec- ? [00:54:40] ah..manifests/role/labstools.pp is the gridengine master it looks [00:55:23] tools-webgrid- above actually sounds like it .. [01:26:27] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:27:33] mutante: does puppet compiler run on labs now? [01:28:31] YuviPanda: i..eh.. i dont really know actually [01:28:42] i just use it for things in site.pp all the time [01:28:50] mutante: yeah so I guess it doesn't since it didn't the last time I checked :) [01:29:02] so i wasnt sure how to handle the same thing when touching the gridengine module [01:29:59] in general with most things gridengine related it is 'hope and pray' [01:30:00] anyways, that "puppet failure" above , that also happens to be on webgrid.. unrelated.. i merged nothing so far [01:30:19] so that seemed not like a good moment maybe, heh [01:30:22] yeah that's unrelated and ongoing for ages [01:30:25] ok [01:30:28] mostly about things running out of memory [01:30:35] I've pooled in 3 new nodes hopefully that'll help [01:31:19] alright, cool [02:01:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [02:11:51] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [02:12:43] YuviPanda: do you know why diamond is so upset about 1408? [02:13:43] andrewbogott: diamond? [02:13:51] andrewbogott: I thought those were usually actual puppet failures [02:13:55] caused partially by memory failures [02:14:03] I’m getting one of these every 5 minutes: tools-webgrid-lighttpd-1408 : Nov 3 01:49:10 : diamond : unable to resolve host tools-webgrid-lighttpd-1408 [02:14:08] not related to puppet as far as I know [02:14:10] oh [02:14:12] hmm [02:14:16] no I've no idea [02:14:21] although that usually means it's a DNS issue [02:14:31] and since diamond does a lot of DNS resolution it's the canary in the coal mine [02:15:02] where does it run? [02:15:22] andrewbogott: I can't even ssh to tools-webgrid-lighttpd-1408 [02:15:28] where does what run? [02:15:43] where does whatever test is complaining about dns resolution run? [02:15:57] no it's not an actual test [02:16:01] diamond just does DNS resolution [02:16:07] so the cron failure is just actual DNS failure [02:16:15] as to why it actually emails people I've no idea [02:16:18] right, but failure where? [02:16:23] on tools-webgrid-lighttpd-1408? [02:16:24] Because I can resolve tools-webgrid-lighttpd-1408 just fine [02:16:33] no it's trying to resolve its own hostname on that host usually [02:16:40] oh! I see. [02:16:41] ok. [02:16:48] andrewbogott: but that instance is dead :| [02:16:52] which is actually more concerning [02:16:55] i can reach it [02:16:58] I can't ssh [02:17:03] although I’m kicked off periodically [02:17:05] no idea what that’s about [02:17:12] connection closed by UNKNOWN [02:17:18] that's not good... [02:17:28] I can’t believe how consistently broken instance migration is in nova :( [02:17:40] it’s still drained, right? [02:17:42] is this the instance we tried to move? [02:17:44] yeah [02:17:48] it's not repooled [02:17:52] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:17:55] I pooled in 3 new other instances [02:18:05] ok, I’ll see if I can get it to shape up [02:18:07] otherwise we can just kill it [02:18:24] andrewbogott: yeah agree [02:34:04] YuviPanda: I’m trying to migrate it again in hopes of convincing openstack to rebuild the networking. that’ll take forever though — I’ll check back. [02:36:33] andrewbogott: ok! [02:37:51] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [03:52:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [04:32:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [08:06:33] 6Labs, 10Tool-Labs: Re-enable tools-webgrid-lighttpd-1408 while there are no new nodes - https://phabricator.wikimedia.org/T117490#1776678 (10yuvipanda) I've pooled 1413 1414 and 1415 ibstead [08:06:39] 6Labs, 10Tool-Labs: Re-enable tools-webgrid-lighttpd-1408 while there are no new nodes - https://phabricator.wikimedia.org/T117490#1776679 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [08:23:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [09:03:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [09:28:46] mutante: for testing puppet manifests on tools-like infra, toolsbeta is the best option for now. puppet-compiler would be really cool to have, though :( [09:29:25] YuviPanda: so you're going to give us millions of *.tools.wmflabs.org certificates? :> [09:54:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:04:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [13:18:44] i need help [13:18:44] Hi aj_, just ask! There is no need to ask if you can ask [13:18:56] ok [13:20:55] and [13:21:10] lol. [14:13:34] 6Labs, 10Beta-Cluster-Infrastructure, 10Labs-Infrastructure, 7Graphite, and 2 others: Delete more specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T111540#1777190 (10fgiunchedi) @krenair I've ran `archive-instances` on `labmon1001` so the deployment-prep hosts are gone, not... [14:50:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:46] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 920220 bytes in 2.589 second response time [15:25:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:32:00] 6Labs, 10Beta-Cluster-Infrastructure, 10Labs-Infrastructure, 7Graphite, 7Shinken: Delete more specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T111540#1777412 (10Krenair) [15:32:34] 6Labs, 10Beta-Cluster-Infrastructure, 10Labs-Infrastructure, 7Graphite, 7Shinken: Delete more specific deployment-prep graphite datapoints - https://phabricator.wikimedia.org/T111540#1606894 (10Krenair) How do we detect that those exist in the first place? [15:40:43] 6Labs, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1777456 (10MarkAHershberger) 3NEW [15:44:19] 6Labs, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1777478 (10Krenair) ```2015-11-03 15:37:22 silver labswiki exception ERROR: [ec2a5414] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=OCG MWExceptio... [15:47:01] 6Labs, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1777482 (10MarkAHershberger) Note that the account was created, though. [15:51:19] 6Labs, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1777512 (10jcrespo) I am unable to locate the error on recent entries of our logs, could you indicate when that happened exactly, or try again an note down at which time the error... [15:53:12] 6Labs, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1777536 (10MarkAHershberger) I'm able to log in as "hexmode" with the user I created. This makes sense since LDAP != SUL. [15:56:19] YuviPanda: last attempt to migrate 1408 went fine, so now I’d like to try migrating the other nodes on https://etherpad.wikimedia.org/p/depool [15:56:28] let me know when you’re around and they’re fully drained? [16:00:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:17] i have an instance, and many users are sharing it. How can we use "become" to share an account like on tools-labs. What is `become` is it a shell script? [16:17:13] notconfusing: vim `which become` [16:17:33] it essentially calls sudo [16:18:40] and sudo is configured to allow people in group tools.abcd to sudo as tools.abcd [16:51:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:28:56] valhallasw`cloud: thanks for pointing out "toolsbeta" is a thing as well [17:31:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [17:37:25] Hi, I want to make some analytics for me on cswiki and skwiki, but I don't know, how to start. I have read some manual on http://wikitech.wikimedia.org, but it didn't help me. What I can do? Ask for shell access or something else? Thank you for advice. [17:38:00] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1778025 (10mmodell) p:5Triage>3High [17:38:28] Urbanecm: what kind of analytics? you want access to server logs ? [17:39:43] Urbanecm: maybe this meta page is useful for you https://wikitech.wikimedia.org/wiki/Analytics [17:40:04] No, I want to analyze content of cswiki, for example count of links to disambiguation pages or list of long untranslanted (to Czech) articles from skwiki. I'll read the page, thank you. [17:47:20] Urbanecm: the easiest interface for querying the database is https://quarry.wmflabs.org/ [17:51:04] I know about Quarry but it have limit 20 minutes and my query took more than 20 minutes. I tried this query: [17:51:08] use cswiki_p; [17:51:10] SELECT categorylinks.cl_from, COUNT(pagelinks.pl_from) FROM page, pagelinks, categorylinks WHERE (categorylinks.cl_to = 'Rozcestníky') AND (categorylinks.cl_from=page.page_id) AND (page.page_title = pagelinks.pl_title) AND (pagelinks.pl_namespace = 0); [19:06:00] 6Labs, 10Labs-Infrastructure, 10netops, 6operations, and 2 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1778439 (10chasemp) [19:22:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:23:45] YuviPanda: ^ few of this type of thing today, is this something that is actionable? [19:24:39] chasemp: so current theory is that it's just that node running out of memory - there's this tool called kmlexport that blows up leakes memory and dies all the time [19:24:47] sand blows up enough memory to cause puppet to fail [19:24:54] I've no idea how to fix that. [19:25:04] it is recurring since it blows up, dies, gets restarted, etc [19:28:48] chasemp: so I guess not fully actionable? [19:29:34] we think some tool has a leak and is blowing things up while it inflates and restarts [19:30:20] 6Labs, 10Tool-Labs, 6operations, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1778548 (10yuvipanda) 5Open>3Resolved Done [19:31:07] chasemp: it's been on for a couple of months and co-inciding exactly with the host kmlexport has been on [19:31:15] let me see if I can trigger moving it to a different host now [19:32:48] chasemp: ok if that theory is true the puppet failures should be on 1414 soon [19:32:58] kk [19:33:05] we wait! ;) [19:33:31] we do! [19:34:48] YuviPanda: yeah, the only solution I can think of is a specific kmlexport host [19:34:58] or fixing kmlexport [19:34:59] yeah [19:35:10] I don't know if the latter is an option from the 'admin' side of things [19:35:24] we can also move it to a container once we have that available and then it'll not kill others but just die itself [19:35:28] mostly what happens is that someone requests something that doesn't finish quickly, then refreshes a few times [19:35:43] and suddenly there's 3 or 4 long-running memory-eating processes [19:35:50] that's actually a great idea [19:36:13] yeah since then it'll just die itself and come back up [19:36:33] yeah prime container candidate [19:36:40] contain the harm to self harm only [19:36:45] yeah [19:44:16] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Urbanecm was created, changed by Urbanecm link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Urbanecm edit summary: Created page with "{{Tools Access Request |Justification=I want to use Labs shell access for: * running a bot which will count number of links to disambiguation pages * running another bot whi..." [20:02:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:56] chasemp: that theory might take a while to verify since 1414 is so underloaded [20:44:06] andrewbogott: I'm checking to see how full the two hosts are [20:44:11] no worries [20:44:29] YuviPanda: what's the suspect process there? [20:44:52] chasemp: ligtppd and its perl children under user tools.kmlexport [20:45:01] andrewbogott: ok, 1402 is ready to go [20:45:15] andrewbogott: 1203 still has some jobs left, so I'd like to give them as much time as possible [20:45:26] andrewbogott: so migrate 1402 and then we can do 1203 [21:14:38] suddenly unable to connect to my labs server in the last day: "ssh: Could not resolve hostname phlogiston-1.phlogiston.eqiad.wmflabs: Name or service not known" [21:15:03] YuviPanda: ok, 1402 on its way... [21:15:07] on several computers where it used to work. Any server-side stuff I should be aware of before I start digging through local changes? [21:21:43] andrewbogott: I'm heading to the office now, ok for 1203 to wait till I arrive there? [21:21:45] might be an hour [21:21:55] YuviPanda: yep, no problem [21:22:15] andrewbogott: cool. [21:23:40] select count(*) from revision, really? [21:24:44] jynus: but but but! [21:25:07] jynus: I think you might be forgetting not everyone understands the performance implications of certain SQL queries :-p [21:25:16] oh, do not worry, it is not enwiki [21:25:18] "I want to see how many revisions..." [21:25:27] exactly. [21:25:35] it is wikidata, where there are just a few edits [21:26:06] and then you google 'mysql count number of rows' et voila! [21:26:30] well, select * in myisam is almost as fast as getting corrupted [21:26:38] so there it is [21:27:41] count(*) is a guranteed sequential scan .. [21:29:15] darkblue_b, sorry, I forgot the [21:29:24] :-) [21:47:21] 6Labs, 10Labs-Infrastructure, 6operations, 3labs-sprint-117, 3labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1779404 (10RobH) If these exist as bare metal to the OS (that has the userspace the labs user is in) then they have direct hardware access... [21:47:44] 6Labs, 10Labs-Infrastructure, 6operations, 3labs-sprint-117, 3labs-sprint-118: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1779414 (10RobH) [22:27:15] There is a bunch of errors in logstash about job runners executing jobs for labswiki on random mw* apaches and failing due to db not visible fromthere. [22:27:38] 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118: Make designate able to pull the project name from a project ID - https://phabricator.wikimedia.org/T117610#1779655 (10Andrew) 3NEW a:3Andrew [22:27:44] https://logstash.wikimedia.org/#dashboard/temp/AVDPdjzEptxhN1Xakktg [22:39:04] YuviPanda: can we merge https://gerrit.wikimedia.org/r/#/c/246295/ [22:42:04] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Urbanecm was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=199405 edit summary: [22:48:00] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1779729 (10demon) p:5Triage>3Normal [22:49:33] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1779738 (10Krinkle) Checking https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors shows these connection errors from RunJobs for... [23:01:22] 10Tool-Labs-tools-Other, 6Community-Tech: Edit Count tool on labs shows text in wrong language - https://phabricator.wikimedia.org/T111512#1779849 (10DannyH) @Qgil, this is the ticket I mentioned to you earlier today. We aren't going to take this -- it might be appropriate for dev relations?