[00:05:21] PROBLEM - Host ToolLabs is DOWN: CRITICAL - Host not found (tools.wmflabs.org) [00:10:27] RECOVERY - Host ToolLabs is UPING OK - Packet loss = 0%, RTA = 0.60 ms [00:12:44] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [00:14:42] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [00:15:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [02:06:27] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL 20.00% of data above the critical threshold [0.0] [02:13:19] PROBLEM - Puppet failure on tools-static-02 is CRITICAL 22.22% of data above the critical threshold [0.0] [02:15:35] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 40.00% of data above the critical threshold [0.0] [02:18:27] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 66.67% of data above the critical threshold [0.0] [02:36:25] RECOVERY - Puppet failure on tools-webproxy-01 is OK Less than 1.00% above the threshold [0.0] [02:43:18] RECOVERY - Puppet failure on tools-static-02 is OK Less than 1.00% above the threshold [0.0] [02:43:26] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [02:45:32] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [08:03:08] Is the service up for this: https://tools.wmflabs.org/xtools-articleinfo/index.php?article=Positional_notation&lang=en&wiki=wikipedia [10:01:06] History page revision statistics is not working! [10:07:15] Guest43199: yup, the service for that is up and running. application error, probably. I'll restart it [10:09:54] Guest43199: works again now [10:16:16] valhallasw: thoughts on https://phabricator.wikimedia.org/T98442#1335815 [10:16:38] YuviPanda: mir egal [10:17:20] valhallasw: ah, I'll take that as a +1 then :) [10:28:39] Merlissimo: around? [10:29:04] Merlissimo: I'd like your thoughts on https://phabricator.wikimedia.org/T98442#1335815 [10:29:54] YuviPanda: https://phabricator.wikimedia.org/T101215 [10:30:07] Still happening, causing cvn maintenance scripts to fail. [10:30:25] looking [10:30:28] Don't have time to change cvn scripts (if needed) until next week probably. [10:31:01] hmm, I can't ssh in normally [10:31:04] * YuviPanda tries root [10:31:20] hmm [10:32:40] Krinkle: hmm, you had a strange resolv.conf [10:32:49] I fixed it manually and it works now. I'm investigating [10:34:15] Krinkle: a puppet run fixed it [10:35:07] interesting [10:35:57] Jun 4 10:13:51 cvn-app5 puppet-agent[3755]: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to fetch instance ID at /etc/puppet/modules/base/manifests/init.pp:21 on node i-000002cf.eqiad.wmflabs [10:36:03] looks like puppet wasn't able to fetch the ID either [10:36:27] YuviPanda: Did you fix it manually and then puppet, or did puppet fix it [10:36:31] Krinkle: former [10:36:34] right [10:36:50] Krinkle: aaah, I might have an explanation! [10:36:59] which is quite terrifying for some reason [10:37:14] Krinkle: puppet thought it was in production [10:37:20] Krinkle: and set your resolve.conf to the prod DNS servers [10:37:23] which won't work for you [10:37:25] causing your problems [10:38:55] Hm.. syslog doesn't go back far enough to see if the chicken (puppet fail) or the egg (gethostbyname) was first. [10:39:21] I think puppet would have been first [10:39:27] since the IPs found there match the prod stuff [10:39:51] Krinkle: so - LDAP failure earlier might've been caught by puppet mid-run and convinced it it was in prod? [10:40:05] ah, there's rotate of syslog, of course. [10:40:38] YuviPanda: Goes without saying, other instances may also be affected. This is the only one in cvn that acted up but then we only have three. [10:40:46] Krinkle: yup :( [10:40:53] Krinkle: let me try salt [10:42:30] according to graphite/nagf, a bunch of other instances also have puppet failures as of ~ 24hours ago [10:42:57] yup, salt sudo hostname is causing a few errors too [10:43:28] YuviPanda: 'hostname -f' has been by test command in salt since forever because it never fails. [10:43:32] :D [10:43:41] for some definition of never fails, of course :D [10:43:44] and now it does :P [10:44:05] I tend to use it when I'm doing magic with the salt matching because I never remember [10:44:12] yeah [10:44:20] soon at least salt will stop using the ec2ids [10:44:22] (and also because salt uses the instance identifier instead of fqdn) [10:44:30] yeah [10:45:43] Krinkle: quite a few are failing [10:46:56] 44 failures [10:47:04] let me run that again [11:02:44] 44 again [11:02:46] not bad [11:20:17] YuviPanda: oh, there was one reason for the tomcat starter [11:20:24] which is/was the seperate tomcat queue [11:20:42] valhallasw: was. it's just running on the generic webgrid [11:20:46] ok [11:48:05] paravoid: andrewbogott_afk https://phabricator.wikimedia.org/T101377 fallout from the LDAP TLS stuff earlier. I've a proposed 'fix' but wanted to verify with either of you before doing it [12:24:11] (03PS1) 10Sitic: Improved logevent support, added de translations [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215894 (https://phabricator.wikimedia.org/T100339) [13:49:22] PROBLEM - Puppet failure on tools-redis is CRITICAL 100.00% of data above the critical threshold [0.0] [14:21:25] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 30.00% of data above the critical threshold [0.0] [14:27:56] valhallasw: I'm shamefully considering self-merge for that tools-webservice patch [14:32:53] YuviPanda: what is going with the ldap start tls issue on labs? [14:33:06] will you guys have to manually fix up all instances? [14:33:10] hashar: https://phabricator.wikimedia.org/T101377 is the bug [14:33:22] hashar: I'm salting things now [14:33:28] ah [14:33:40] beta and integration have their own saltmasters though [14:33:58] deployment-salt.eqiad.wmflabs and integration-saltmaster.eqiad.wmflabs [14:34:10] can't sudo on them though :-((( [14:35:13] what needs to be done exactly hashar? [14:35:28] no clue :-} [14:36:21] https://phabricator.wikimedia.org/T101377#1337162 [14:36:26] or a sed that I'm experimenting with now [14:36:42] sounds fun [14:41:28] hashar: you should be able to sudo on deployment-salt? [14:41:33] It has the correct ldap.conf [14:41:53] I just sudo'd to you and then sudo'd back to root [14:42:05] can you apply the fix ? [14:42:30] deployment-salt itself seems fixed. [14:51:24] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [14:52:51] YuviPanda: deployment-prep seems fine. can't sudo on integration-saltmaster to run the sed command :\ [14:53:02] thcipriani: I can do that, moment [14:53:09] cool. thanks. [14:53:39] thcipriani: done [14:54:10] sudo is back on integration saltmaster :) [14:54:32] !log integration ran sudo sed -i 's/GlobalSign_CA.pem/ca-certificates.crt/' /etc/ldap/ldap.conf on integration-saltmaster [14:54:39] Logged the message, Master [14:54:44] hashar: looking at ldap on the rest of integration now. [14:54:54] thcipriani: a bunch of us have 'root keys' to all labs instances, and can get in during ldap failures (situations like these) [14:54:55] \O/ [14:55:44] YuviPanda: sounds nice :) [15:03:48] hashar: spot check on a handful of integration instances post-salt run indicates that ldap seems fixed on integration now :) [15:03:55] \O/ [15:04:18] thanks YuviPanda ! [15:06:31] Coren: if you’re around, help me debug a puppet/ldap issue? (that I suspect is not related to yesterday’s outage) [15:07:39] andrewbogott: btw, I'm taking care of https://phabricator.wikimedia.org/T101377 in all the hosts that salt could reach. down to 1, I think [15:07:47] thanks [15:07:54] andrewbogott: Sure. [15:08:28] Coren: client, testlabs-newpuppet master, labcontrol1001 [15:08:55] certs and everything are working, but it can’t get the node def from ldap [15:09:02] I can’t tell why… ldap is reachable from both hosts [15:09:36] I'll go look. [15:09:44] thanks! [15:10:31] YuviPanda: ... what time zone are you working from now? You're in Scotland iirc? [15:10:46] moving out see you tomorrow! [15:13:58] andrewbogott: I thought the i-* names were deprecated in the new setup? [15:14:15] not yet, that’s a later step [15:14:41] scheduled for a week from today. [15:15:14] Coren: yes [15:15:27] so some sort of vaguely scottish timezone [15:16:09] Heh. For you that means pretty much anything between UTC-7 and UTC+9. :-) Though if you are trying to sync up with your SO that may be more restricted. :-P [15:16:13] He says ‘the scottish timezone’ because it’s bad luck to say ‘macbeth standard time’ [15:16:27] haha [15:16:35] * Coren gives three points to andrewbogott for the Bard reference. [15:17:01] Coren: more 'regular' today 'coz am going to go to a lecture in the evening [15:17:34] Coren: also thoughts on https://phabricator.wikimedia.org/T98442#1335815 (and following comments) [15:17:41] * Coren looks [15:19:12] (03PS2) 10Sitic: Improved logevent support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215894 (https://phabricator.wikimedia.org/T100339) [15:20:35] (03PS3) 10Sitic: Improved logevent support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215894 (https://phabricator.wikimedia.org/T100339) [15:25:28] andrewbogott: The only odd discrepancy I think I see is the node name it attempts to use to find its definition; it's explicitely looking for 'i-00000c9f.testlabs.eqiad.wmflabs: which is in ldap as an associatedDomain, but the actual dn uses i-00000c9f.eqiad.wmflabs. [15:25:55] andrewbogott: Do you know if the ENC uses the fqdn to construct a dc=* dn or it actually searches for the associatedDomain? [15:26:00] the behavior is the same on virt1000 though [15:26:30] From puppet.conf: ldapstring = (&(objectclass=puppetClient)(associatedDomain=%s)) [15:26:40] Ah, hm. [15:26:54] * Coren keeps digging. [15:30:53] coren, I heard there is still some weirdness with admin module and labstore hosts? Is taht something I can try to help with? [15:31:10] I had thought the conflict was resolved though so I'm not sure what's up [15:32:05] chasemp: There's a patch for it, but it currently has a -2 from faidon because it wasn't clear to him what I was trying to do at the time. He's been ridiculously busy since though. (We need to clone him a few helpers) [15:32:21] can you point me to it? [15:35:12] my busyness isn't the issue here though [15:35:24] chasemp: https://gerrit.wikimedia.org/r/#/c/207514/ I think [15:35:25] I haven't heard back, I just asked about an update on this last week [15:36:29] (on phab) [15:37:15] when I had looked upstream puppet had approached this issue with a "local=true" thing for accounts that matched ldap to get around getent being silly about what accounts actually exist [15:37:28] I had not actually run through and tested but I thought that was the general direction [15:37:51] I'm not sure if this patch is solely for the admin conflicts or if it's solving it as a byproduct and bringing things in line with prod [15:38:16] paravoid: mark suggested that I extend the rationale on the ticket to the patch itself so that the context is clearer. [15:38:38] paravoid: I'll do so later today after I beat LDAP up some more. [15:40:00] Coren: is this mainly to at at the odd ldap/admin overlap? [15:40:10] to get at I meant / solve [15:40:15] https://phabricator.wikimedia.org/T87870 was the original task, has no updates [15:40:38] paravoid: Ah, sorry, I updated the subtask not the supertask. [15:40:50] this was forked off to https://phabricator.wikimedia.org/T95559 at some point, which was not originally linked [15:41:41] paravoid: That was probably my fail when I separated the larger task to create subtask; I must have forgotten to link them. [15:44:54] and thanks andrewbogott [15:46:01] andrewbogott: I've probably reached your point at debugging. I'm at the "everything works except it doesn't" stage. :-) [15:46:50] Coren: great! Then we agree [15:49:58] Coren: note that labvirt1001 is Trusty so we’re on untrammeled ground. Could be a required config change or an unsupported feature or… whatever. [15:50:14] labcontrol1001 right [15:50:46] yes, sorry [15:51:04] my fingers have trouble typing anything but ‘virt’ after ‘lab’ these days [15:51:08] :D [15:51:11] so much labvirting [15:51:18] yeah [15:55:09] I'd really like to not have an NFS -> LDAP dependency [15:56:23] paravoid: There is no way around it, sadly; that said, the worse that can heppen with that setup is that supplemental group permissions are no longer honored (it fails safely). [15:56:36] yeah, famous last words [15:57:50] paravoid: Known to be true because it broke that way before. :-) But more seriously, we can get ldap out of the way of auth on the servers and bring them into the fold this way but getting rid of group information is not workable. [15:58:14] Ah, lunch is here. BBIAB [16:12:09] * Coren is back [16:36:15] andrewbogott: I find nothing broken in any of the individual components. The only thing I can't rule out is some oddity in the ruby invoked by passenger? Perhaps another 1.9 oddity? [16:36:49] Could be [16:36:54] I just turned on some more logging... [16:37:14] Puppet (debug): Failed to load library 'ldap' for feature 'ldap' [16:37:18] That’s clearly it. [16:37:36] That... seems likely. :-) [16:38:26] of course that is /all/ it says [16:39:10] "An error has occured. Please consult your system administrator." [16:41:42] Coren, moving to operations because alex is there [17:15:51] (03PS2) 10Ricordisamoa: Remove duplicate title and unused `s_text` argument [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/211055 [17:17:21] (03CR) 10Sitic: [C: 032 V: 032] Improved logevent support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215894 (https://phabricator.wikimedia.org/T100339) (owner: 10Sitic) [18:28:18] 10Quarry: Quarry's indentation function is not completely functional - https://phabricator.wikimedia.org/T101424#1338373 (10Huji) 3NEW [18:33:42] 10Quarry: Quarry's indentation function is not completely functional - https://phabricator.wikimedia.org/T101424#1338395 (10Huji) [19:19:10] any tools admins around? one of my web tools needs a memory limit increased... [19:33:50] 10Quarry: Quarry should show the results in the way they were ordered - https://phabricator.wikimedia.org/T101396#1338583 (10Umherirrender) [19:33:54] 10Quarry: Quarry does not respect ORDER BY sort order in result set - https://phabricator.wikimedia.org/T87829#1338584 (10Umherirrender) [19:43:24] YuviPanda: ping [20:01:47] YuviPanda: Quick +1 to https://gerrit.wikimedia.org/r/#/c/215918/4 now that the old crap has been axed? I'd like to be able to close this. [20:16:58] Coren: did you solve https://phabricator.wikimedia.org/T84032 ? [20:17:53] Coren: consider it a virtual +1? Am out :) [20:18:02] (If the dead hosts have all been removed) [20:22:51] They're dead, Jim. [20:24:09] Earwig: open a bug? I'll get to it when I'm back at lapyop [20:24:16] sure [20:24:32] Also, can do Earwig. Point me at the bug once filed and I'll handle it. [20:28:22] !log deployment-prep upgrading hhvm-fss from 1.1.4 to 1.1.5, has fix for T101395 [20:29:52] Logged the message, Master [20:30:37] 10Tool-Labs: Raise memory limit for copyvios web tool - https://phabricator.wikimedia.org/T101437#1338754 (10Earwig) 3NEW [20:30:41] Coren: ^ [20:31:55] Earwig: Got it. Ima raise to 6, but if you're hitting that wall again anytime soon we'll probably have to dig deeper to figure out where that vmem is going. [20:32:04] yeah, that sounds good [20:33:43] 10Tool-Labs: Raise memory limit for copyvios web tool - https://phabricator.wikimedia.org/T101437#1338777 (10coren) 5Open>3Resolved a:3coren Raised to 6g. [20:33:52] {{done}}. You'll have to restart your webservice for it to stick. [20:34:28] it's up, thanks [20:58:28] Cpiral [20:59:55] "Topedits" is not working: https://tools.wmflabs.org/xtools/topedits/?user=Cpiral&lang=en&wiki=wikipedia&page=Positional_notation [21:05:41] MusikAnimal: ^ [21:06:42] trying a manual reboot [21:14:02] I hope that's not your password Guest6582 [21:14:33] oh, is that a user? [21:14:35] ok [21:14:47] It's my user name on Wikipedia. [21:17:30] okay [21:17:37] should be back up! [21:39:14] hmm i'm getting "failed to create instance". it worked moments ago, but just failed twice in a row. [21:43:10] over quota? [21:43:11] perhaps, i had a lot of instances in my project. just deleted a few, retrying.. [21:43:11] yep, that worked. thanks. [21:43:55] that error message could be more helpful [22:00:43] 6Labs: Convert all ldap globals into hiera variables instead - https://phabricator.wikimedia.org/T101447#1339137 (10yuvipanda) 3NEW [22:01:30] chasemp: ^ I filed that bug [22:20:14] (03PS1) 10Sitic: Add allrev fallback [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215988 [22:20:34] (03CR) 10Sitic: [C: 032 V: 032] Add allrev fallback [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215988 (owner: 10Sitic) [22:35:19] 10Tool-Labs, 3Labs-Sprint-100: setup host-based auth for tools hosts properly - https://phabricator.wikimedia.org/T98714#1339336 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done now. Can ssh from bastions to exec / other hosts, but not to 'infrastructure' hosts (webproxy, for example) [22:46:19] 6Labs, 10ContentTranslation-cxserver, 6Services, 6operations: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339388 (10yuvipanda) I salted the sed on most machines earlier, and they all seem ok now. [22:54:06] 6Labs, 10ContentTranslation-cxserver, 6Services, 6operations: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339418 (10faidon) FWIW, I ran a sed yestrerday via salt across the Labs fleet (in fact, the same salt command as you did). I don't know why... [22:57:45] 6Labs, 10ContentTranslation-cxserver, 6Services, 6operations: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339425 (10yuvipanda) Salt issues possibly, also puppet might've set them back if it was in an inconsistent state? [23:30:14] Can I make a cron such that it will email me the output if there is output? [23:32:36] 6Labs: Convert all ldap globals into hiera variables instead - https://phabricator.wikimedia.org/T101447#1339600 (10scfc) [23:38:13] a930913: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Mail_from_tools [23:39:57] sitic: Yeah, that's what I'm trying to avoid. [23:41:33] sitic: Cron would usually do it, but because all crons are jsubbed, the jsub value is the cron return. [23:43:34] a930913: you'll need to jsub a bash script which calls the program and pipes the output to mail/exim [23:44:04] sitic: I worried as much D: [23:45:06] Simple script turned overkill D: [23:45:38] I'm going to spend more time getting it to email, than the script checking what it's checking. [23:46:09] a930913: https://dpaste.de/C9Sn should do it I think [23:46:36] eh exim instead of mail [23:48:51] Don't want another wrapping script though D: [23:49:56] haha I know the problem :-/ [23:53:26] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1339734 (10Teslaton) 24+ hour outage, can anyone have a look and resume normal op, if possible? http://tools.freeside.sk/monitor/http-kmlexport.html ToolLabs httpd itself seems to be healthy.