[00:18:13] 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Wikitech often loses track of internal openstack/nova session - https://phabricator.wikimedia.org/T101199#1332480 (10Krinkle) 3NEW [00:18:46] 6Labs, 10Labs-Infrastructure: labsconsole: Empty instance list - https://phabricator.wikimedia.org/T73731#1332493 (10Krinkle) [00:18:51] 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org: Wikitech often loses track of internal openstack/nova session - https://phabricator.wikimedia.org/T101199#1332496 (10Krinkle) [00:19:17] 6Labs, 10wikitech.wikimedia.org: Unable to see or delete existing web proxy - https://phabricator.wikimedia.org/T90391#1332498 (10Krinkle) 5Open>3Resolved a:3Krinkle [01:51:49] I applied for an OAuth token about a week and a half ago on Mediawiki.org and I haven't heard anything. Am I doing it wrong? [03:24:15] Magog_the_Ogre: yeah, you need to bug someone with staff rights on IRC :P [03:24:53] thanks legoktm [03:25:39] 6Labs, 10Labs-Infrastructure, 7Regression: `hostname -f` on cvn-app5.eqiad.wmflabs returning "error: Name or service not known" - https://phabricator.wikimedia.org/T101215#1332818 (10Krinkle) 3NEW [03:50:06] (03PS2) 10Mattflaschen: basic setup for the application [labs/tools/flow-oauth-demo] - 10https://gerrit.wikimedia.org/r/213590 (https://phabricator.wikimedia.org/T101217) (owner: 10Rjain) [06:33:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 70.00% of data above the critical threshold [0.0] [06:58:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [07:38:59] 6Labs, 5Patch-For-Review, 7database: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1333065 (10jcrespo) a:5jcrespo>3None [08:54:31] YuviPanda: *poke* :) [08:54:37] hi addshore [08:54:38] 'sup [08:54:54] hows the cold north of the UK? :) [08:55:04] addshore: VERY [08:55:09] Also, any idea which files I should poke for https://phabricator.wikimedia.org/T100885 ? [08:55:31] loooking [08:55:57] addshore: hmm, apergos would know, I think. it's an nfs mount on teh dataset hosts, let me poke around [08:55:59] I would guess in https://github.com/wikimedia/operations-puppet/tree/production/modules/dataset/files/labs ? [08:56:08] kk :) [08:56:37] I mean, they are currently in /data/scratch wikibase and wikidata [08:56:41] addshore: looks like https://github.com/wikimedia/operations-puppet/blob/production/modules/dataset/files/labs/labs-rsync-cron.sh [08:56:43] but best make things consistent and easy :P [08:56:56] addshore: yes, and /public is in a different mount with less outages :) [08:57:01] xD [08:58:57] 10Tool-Labs, 3Labs-Sprint-100: Move toollabs to designate - https://phabricator.wikimedia.org/T100023#1303692 (10yuvipanda) This kind of blew up and caused a gridengine outage, but is all done now. Awaiting more details + incident report from @COren. [09:05:04] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, 3Wikidata-Sprint-2015-06-02: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1333173 (10Addshore) a:3Addshore [09:06:52] awesome, patch up! ;) [09:18:46] !log wdq-mm deleted wdq-mm-02 [09:18:51] Logged the message, Master [09:41:15] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, and 2 others: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1333206 (10Addshore) So everyone watching this ticket has some idea of a timescale for this! ``` 10:06 AM (PS1) A... [10:28:06] 6Labs: Upgrade postgres on labsdb1004 / 1005 to 9.4 - https://phabricator.wikimedia.org/T101233#1333262 (10yuvipanda) 3NEW [10:32:44] 6Labs: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1333272 (10Yurik) [10:52:12] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1333304 (10yuvipanda) Recreated it, but I wdq-mm doesn't start yet. I guess the package needs to be updated and the data file copied over. [11:09:59] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1333322 (10Magnus) On wdq-mm-01, the file is /srv/wdq/latest.wdq I created that directory on 02, but can't scp the file over from 01 for some reason. [11:10:47] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1333323 (10yuvipanda) I usually just cp the file to /data/scratch and copy it back the other side :D [11:23:10] 10Tool-Labs-tools-Other, 10Phragile, 6TCB-Team: Deploy Phragile on tool-labs - https://phabricator.wikimedia.org/T100192#1333343 (10Tobi_WMDE_SW) It is deployed on http://phragile.wmflabs.org now but not jet puppetized. I'm closing this task now and there is a separate one for Puppet: T101235 [11:34:53] 6Labs: milimetric and halfak would like postgresql database access - https://phabricator.wikimedia.org/T91267#1333410 (10yuvipanda) 5Open>3Resolved halfak has access for a while now. [11:47:41] 6Labs: Cannot ssh into wdq-mm-02.eqiad.wmflabs - https://phabricator.wikimedia.org/T101102#1333486 (10Magnus) Thanks, that worked. File is copied. But the service still won't run. :-( [12:02:32] 6Labs, 6operations: labvirt1005 doesn't boot up - https://phabricator.wikimedia.org/T100030#1333509 (10faidon) What's the status of this? Is it blocked on someone outside the Labs team? [12:09:12] 6Labs, 6operations: labvirt1005 doesn't boot up - https://phabricator.wikimedia.org/T100030#1333520 (10yuvipanda) Ugh, this fell through the cracks :| Ideally, someone will investigate ways to get this machine booting up on a kernel that's new enough to not have the memory issues that @bblack pointed out - an... [12:17:14] 6Labs, 6operations, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1333536 (10yuvipanda) T100030 is related [12:21:30] 6Labs, 6operations: labvirt1005 doesn't boot up - https://phabricator.wikimedia.org/T100030#1333543 (10yuvipanda) I've asked for help in the ops@ list again. [12:28:24] 6Labs, 6operations: labvirt1005 doesn't boot up - https://phabricator.wikimedia.org/T100030#1333554 (10yuvipanda) @andrew says that similar issues had cropped up in another machine before, and a rollback to an older kernel fixed it. [12:47:14] YuviPanda: regarding the boot problems of virt* [12:47:21] what is the OS ? [12:47:25] ubuntu ? [13:20:53] 6Labs, 10Labs-Infrastructure, 6operations, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1333661 (10coren) The copy is progressing nicely (if not as fast as hoped); the bottleneck appears to be the ssh channel window size... [13:45:05] 6Labs, 5Patch-For-Review, 7database: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1333697 (10Andrew) Is this done, or still pending the backup work? [14:14:16] 6Labs, 5Patch-For-Review, 7database: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1333782 (10jcrespo) @Andrew Blocked by the above patches (backups). [14:15:55] 6Labs, 6operations: labvirt1005 doesn't boot up - https://phabricator.wikimedia.org/T100030#1333783 (10BBlack) The memory issues were just a random guess, not real evidence. I do think getting on newer kernels is probably a win in general, though. The alerts about not finding disks.... is this generic to all... [14:53:04] andrewbogott: YuviPanda: Anything to add to http://etherpad.wikimedia.org/p/Labs-20150602 [14:53:27] Not sure what I could put in Actionables though. [14:54:17] ‘fix gridengine’ :( [14:55:13] Well yeah, having the alias bypass the check is arguably a bug in gridengine but in fairness it's not /supposed/ to support renaming exec nodes at all in the first place. [14:55:35] Yeah, I guess renaming a running server isn’t exactly standard practice. [14:55:55] The "correct" thing to do would have been to create new nodes with the new naming scheme; we avoided that because disruptive outage for running jobs (which we did avoid) [14:57:06] "Host name will not change while the system is live" is not an entirely unreasonable presumption. :-) [14:57:35] yeah [14:57:59] I added a few actionable [14:57:59] s [14:59:10] andrewbogott: just for some impact feeling: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Matanya&ilshowall=1 <- those videos were produced by the video project in labs, thank you! [14:59:29] YuviPanda: That second one would be an interesting exercise worthy of writing a whole book, not a recipe. I don't think it's plausible to write a tutorial on "how to do time recovery of a BDB including manually editing a dump for newbies" :-) [14:59:41] matanya: lots! [15:00:47] YuviPanda: Beyond "have had to recover broken BDB before lots in panicky situations and learn to read and edit the outbut of db_dump and friends" :-) [15:20:27] 6Labs, 10Labs-Infrastructure: missing database entries at categorylinks table on dewiki db - https://phabricator.wikimedia.org/T72711#1334028 (10Merl) And currently again: ``` $ mysql -hs5.labsdb -vvve "select page_id, page_latest, cl_to from dewiki_p.page left join dewiki_p.categorylinks on page_id=cl_from wh... [15:23:39] 6Labs, 10Labs-Infrastructure: missing database entries at categorylinks table on dewiki db - https://phabricator.wikimedia.org/T72711#1334042 (10coren) @merl: That query produces exactly the same result in production - whatever the issue you expect may be, it is not related to replication. [15:25:55] (03PS1) 10Sitic: logevent support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215639 [15:26:14] Coren: i am expecting the categories also shown at the bottom of http://de.wikipedia.org/w/index.php?title=Leopold_Mozart&oldid=142745685 [15:26:20] (03CR) 10Sitic: [C: 032 V: 032] logevent support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/215639 (owner: 10Sitic) [15:27:10] to which project must the task moved then? [15:27:15] Merlissimo: That may be an issue related to the job runners, but I can tell you that production databases give exactly that result for that query. [15:27:37] Merlissimo: I *think* there is a related ticket already. Give me a minute. [15:30:31] Merlissimo: I can't find a specific bug. I might suggest MediaWiki-JobQueue as a first attempt, if it's not queue related, they're more likely to know where the issue is. Definitely not DB though [15:34:11] how long will the dns changes last? [15:34:48] doctaxon: They're intended to be permanent. [15:35:15] permanent until when? [15:36:23] Wait, I'm not sure I understand the context to your question, then. [15:38:22] My sessions on bastion, trusty and login shut down every minute [15:39:25] Hm. That has nothing to do with DNS (I expect you thought so because of the /topic which refers to something else and which is, in fact, done) [15:39:30] with connection abort [15:39:42] doctaxon: Lemme see if I can see why in the logs. [15:40:24] doctaxon: What is you username? [15:41:33] Coren - Can you access to the logs? [15:41:50] the bot username is taxonbot [15:44:32] doctaxon: I see nothing in the log except the connection closing on your end; nor do I see anyone else with suspiciously brief sessions. Perhaps there is a network issue on your side? Have you tried to connect elsewhere? [15:45:24] There is no possibility to connect elsewhere right now [15:46:16] You may try one of the general bastions. (bastion.wmflabs.org) [15:54:48] (03PS1) 10Ricordisamoa: Initial commit [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/215644 [15:55:06] Coren - general bastion does not support become [15:55:23] Well, no, this was only about testing your ssh issues. :-) [16:04:55] (03PS2) 10Ricordisamoa: Initial commit [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/215644 [16:14:24] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#1334179 (10kevinator) 5Open>3declined a:3kevinator I'm closing this task because it is very broad and there are no clear next steps to... [16:38:12] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334291 (10Nemo_bis) I don't understand, was the package actually installed? [16:49:09] 6Labs: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1334317 (10Yurik) [16:49:29] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334320 (10yuvipanda) 5Resolved>3declined [16:50:13] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1325330 (10yuvipanda) That's better :) [17:11:00] thcipriani: how did staging hold up through the change? [17:11:58] so, changed the use_dnsmasq=false, which changed /etc/resolv.conf to use 208.80.154.12 as a nameserver, then it seems, it refuses to connect to ldap [17:12:15] .12? [17:12:19] It should be .20 I think [17:12:22] yeah, seems like it should be .20 [17:12:28] Sounds like your puppet repo is out of date [17:12:39] ah, lemme check that [17:13:06] I think there’s a class you can set that will automatically rebase on a cron. I don’t know how well it works though [17:16:52] andrewbogott: staging was indeed behind, rebased re-running puppet now [17:17:11] great. Should help, although you might have to hand-tune resolv.conf to get puppet runs [17:18:29] kk, so it is .20 now, and I can still hit ldap, so hooray :) [17:18:49] cool [17:20:20] andrewbogott: so, it seems, the master certname has updated in puppet.conf on staging-palladium, but not the agent certname. Even on subsequent runs. [17:21:01] Ok, let me have a look... [17:23:57] thcipriani: that’s the puppetmaster, right? [17:24:01] right [17:26:28] where is the name of the puppetmaster defined? [17:26:42] also, fun thing, since I set use_dnsmasq to false, and the nameserver was incorrect for whatever reason, the other instances in that project can connect up to the puppet master to receive the update and (with no ldap) I have no sudo privileges to update :( [17:27:01] s/can connect/can't connect/ [17:27:19] 6Labs, 10Labs-Infrastructure: unstable puppet runs on holmium - https://phabricator.wikimedia.org/T101281#1334458 (10Andrew) 3NEW a:3Andrew [17:27:21] andrewbogott: you mean in /etc/puppet/puppet.conf? [17:27:42] thcipriani: I mean, where is the puppet setting that tells the clients what the master is named. [17:27:48] I’m confused by hiera vs. ldap [17:28:28] ah, so it should be in hiera in the case of staging [17:28:42] it's definied in role::puppet::self as an argument ot that class [17:29:01] so if it's not defined in hiera, then it uses the top level variable retrieved from ldap [17:29:32] * andrewbogott looks at puppet code [17:30:05] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334470 (10Sitic) @yuvipanda Nemos problem is actually that trusty has no pip installed (precise had): https://wikitech.wikimedia.org/wiki/Help_talk:Tool_Labs/Python_application_stub (virtualenv may be a bit overwhe... [17:32:06] thcipriani: I am going to rm -rf /etc/puppet/puppet.conf.d/10-self.conf to force it to regenerate. [17:32:41] kk [17:33:33] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334483 (10Nemo_bis) Thanks both. >>! In T100977#1334470, @Sitic wrote: > @Nemo_bis Yuvi is basically asking you to run: > > ``` > virtualenv ~/env > source $HOME/env/bin/activate > echo "source $HOME/env/bin/activ... [17:33:43] ok, looks the same as before. To you too? [17:33:58] certname is i-0000094c.staging.eqiad.wmflabs which seems right to me [17:35:38] okie doke. Looks like the puppet master also generated private keys for that name at some point [17:37:17] 6Labs, 10Labs-Infrastructure: unstable puppet runs on holmium - https://phabricator.wikimedia.org/T101281#1334495 (10yuvipanda) This might be causing ocassional DNS outages, evident from some gridengine failures and some diamond error messages [17:38:28] andrewbogott: so the name staging-palladium.eqiad.wmflabs still resolves, will that always be true/should that be the case? [17:40:25] yes, for the near (and possibly distant) future the new dns maintains those names as well as the more correct .project.eqiad.wmflabs names. [17:40:45] It’s sort of bad practice to rely on them since they can be hijacked, but… that’s a memo for another day. [17:41:09] kk, just wondering if the salt minion conf would have to change [17:41:31] doesn't seem like it _has_ to, but probably should [17:41:46] andrewbogott: yes, I was hoping to start naming new toollabs hosts just 'exec-01' instead of 'tools-exec-01' but that's still problematic [17:42:05] thcipriani: yeah [17:42:11] So are you getting good puppet runs on clients now? [17:42:56] checking... [17:45:20] 6Labs, 7Blocked-on-Operations: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1334561 (10Yurik) [17:45:31] 6Labs, 10Maps, 7Blocked-on-Operations: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1333262 (10Yurik) [17:45:48] some are still in the state where they can't contact ldap :\ others are complaining about the cert from puppet master not matching. [17:45:54] others are fine [17:48:03] thcipriani: ok… you saw the step where you have to rename the puppetmaster on the client in puppet.conf? [17:49:43] missed that, just tried it, problem now is the old dns can't resolv the new hostname. [17:49:58] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334616 (10Sitic) I've added it to https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#My_tool_requires_a_package_that_is_not_currently_installed_in_Tool_Labs._How_can_I_add_it.3F I'm not sure what to do with https://... [17:50:32] andrewbogott: I'm on staging-sca01, FYI [17:50:46] hm, I wonder why that didn’t happen to me in my tests [17:51:03] I wonder if we could just delete the certs on the client instead of changing the hostname in puppet.conf [17:51:19] yes, that should be fine. [17:51:52] but, then, will it grab the cert for the old hostname again? Well, let's find out. [17:52:25] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: unstable puppet runs on holmium - https://phabricator.wikimedia.org/T101281#1334628 (10Andrew) https://gerrit.wikimedia.org/r/#/c/215673/ fixed the main problem. runs are still unstable due to ferm rules. [17:54:27] 10Tool-Labs: Install "internetarchive" python module - https://phabricator.wikimedia.org/T100977#1334657 (10yuvipanda) https://merlijn.vandeen.nl/2015/flask-mwoauth-on-tools.html has good info on setting up a flask app. [17:56:34] andrewbogott: hmm, so removed certs on client, cleaned the cert from master, requested new cert and it seems to have fetched the old one :\ [17:56:50] Server hostname 'staging-palladium.eqiad.wmflabs' did not match server certificate; [17:57:24] * andrewbogott will log in and look [17:57:52] maybe just manually wrangling resolv.conf or adding the master to /etc/hosts and renaming on the client [17:58:16] easiest things I can think of off the top of my head :\ [17:58:54] because the wrong dns ip got sent out everywhere… we’re in unexplored territory [17:59:53] right, this did resolve itself on other machines, oddly [18:01:14] Hey, there is a big issue, Can you install requests library in labs (both on tools and grid engine) [18:01:19] andrewbogott: ^ [18:01:33] pywikibot now is using requests [18:01:45] thcipriani: well, also puppet doesn’t run cleanly on your clients. [18:01:51] I mean, it doesn’t compile. [18:01:57] So nothing much is going to happen on those systems :( [18:02:04] have a look on staging-sca01 [18:02:17] Amir1: hi. you should use a virtualenv. [18:02:36] hey [18:03:06] It's possible but it would make things easier for people [18:03:08] hmm, bitrot of some of these patches, but the fact that it's contacting the server is good enough in that instance [18:03:33] Amir1: not in the long term no. with a virtualenv, you control all the depdencies yourselves. [18:03:37] Amir1: I was worried about that too when I saw the sudden switch, but requests seems to be installed (at least the exec nodes) [18:03:39] Amir1: also, requests is already installed anyway. [18:03:58] but in the future (we should write this down somewhere), I think we should be encouraging more people to use virtualenvs [18:04:15] hmm [18:04:16] thcipriani: so, I just did what you suggested… pasted the new resolver ip into resolv.conf and edited the server name. [18:04:16] ok [18:04:29] Amir1: in the case of pywikibot, we could probably provide a venv. Not completely sure. [18:04:44] also, requests has the complete mess there's various versions used [18:04:48] andrewbogott: kk, noted if we run into it with deployment-prep. [18:05:08] yes, and global packages have the problem that some people depend on one version and others on another [18:05:09] YuviPanda: is there a reason to prefer virtualenv and not pip install --user? (except having several virtualenvs) [18:05:39] sitic: vitualenvs are per-app [18:05:45] sitic: and dependencies are per-app, so that makes more sense [18:06:08] YuviPanda or andrewbogott: I'm having problems adding a user to the suggestbot project through wikitech. Is that a known issue? [18:06:08] One noteworthy thing before moving into deployment prep: I think the ldap-yaml-enc.py update has to merge first [18:06:11] valhallasw: I mean that pip is not installed in trusty, so that you have to use virtualenv [18:06:18] Nettrom: have you tried logging out and logging back in? [18:06:26] sitic: that's just ubuntu retardedness [18:06:33] ok [18:06:40] idk, the old tools-login someone might have bypassed puppet and installed pip? [18:06:57] I think python-pip should just be installed [18:07:02] YuviPanda: ok, have you checked the issue about dump reading? [18:07:06] YuviPanda: yeah, didn't help. the user I'm trying to add isn't in the dropdown list [18:07:07] It's a big blocker for me [18:07:21] Amir1: can you file a bug? [18:07:28] I already did [18:07:43] YuviPanda: is it the case that labs ldap info now comes from that ldap-yaml-enc.py script? [18:07:43] alright. I haven't gotten around to be able to look at it yet, sadly :( [18:07:46] Nettrom: um… drop down list? Sounds like you’re trying to add an admin [18:07:48] rather than a normal user [18:07:54] thcipriani: only for staging, actually. [18:08:06] andrewbogott: so I should not be doing this from https://wikitech.wikimedia.org/w/index.php?title=Special:NovaServiceGroup&action=managemembers&projectname=tools&servicegroupname=tools.suggestbot ? [18:08:11] YuviPanda: https://phabricator.wikimedia.org/T100227 [18:08:15] ah, ok, I can see the script in puppet.conf on other instances, but wasn't sure what it's role was [18:08:17] valhallasw: it's going to error for people installing global scripts anyway, I guess. I don't mind getting it installed. [18:08:24] Nettrom: there are two ‘add user’ links. One adds them to a role, the one to the right adds them to the project... [18:08:36] s/add user/add member/ [18:08:41] Nettrom: see what I mean? [18:09:02] Whoah, wait, you’re talking about service groups and not projects… [18:09:08] * andrewbogott pretty confused [18:09:12] andrewbogott: yeah, sorry [18:09:39] what user are you trying to add? [18:09:46] andrewbogott: kjschiroo [18:10:37] 10Tool-Labs: make https://dumps.wikimedia.org/other/wikidata/ available on tool labs - https://phabricator.wikimedia.org/T98655#1334758 (10Sitic) [18:10:40] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, and 2 others: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1334759 (10Sitic) [18:11:20] Nettrom: it… works fine for me. You’re seeing a million names in the drop-down but not that one? [18:11:53] andrewbogott: exactly! [18:12:31] Nettrom: well, I added them. I can’t explain what you’re seeing. [18:13:02] andrewbogott: I couldn't figure it out either, since his user talk page indicated he's already got all the access needed [18:13:21] andrewbogott: thanks for the help! [18:15:22] YuviPanda: looking at deployment_salt puppet.conf, somehow node_terminus is exec and not ldap. I may be misunderstanding something... [18:15:53] thcipriani: oh, is deployment-salt is also using ENC? [18:16:04] I _think_ so [18:17:29] yes, it is: https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [18:17:45] role::puppet::self::enc: yaml+ldap [18:18:12] same with Integration: https://wikitech.wikimedia.org/wiki/Hiera:Integration [18:19:07] which, if we're sticking with that, we'll need to merge https://gerrit.wikimedia.org/r/#/c/202790/ before the new dns rolls [18:19:07] thcipriani: ugh, sorry, in like, 4 different conversations at once :) [18:19:28] thcipriani: am merging that now [18:19:45] YuviPanda: kk, thanks [18:19:59] thcipriani: has staging been swiched already? [18:20:04] YuviPanda: yup [18:20:50] thcipriani: done [18:20:52] andrewbogott: that enc change will have to roll out before we can switch use_dnsmasq to false on any projects using node_terminus: exec in puppet [18:20:56] YuviPanda: thanks :) [18:21:13] thcipriani: ah, true. [18:21:25] But step 1) update puppet will roll it out, right? [18:21:39] there's also only 3 projects that use it I think [18:21:58] yeah, puppet update should roll it, I think [18:22:29] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: unstable puppet runs on holmium - https://phabricator.wikimedia.org/T101281#1334852 (10Andrew) 5Open>3Resolved [18:23:27] andrewbogott: ok, ready to break deployment-prep? :) [18:24:04] thcipriani: yep! No time like the present. [18:24:20] alright, jumping on deployment salt to update puppet [18:24:26] logging in -releng [18:30:12] 10Tool-Labs, 10Incident-20150602-gridengine-dns-failure: Tools: puppetize the alias_hosts workaround for mismatching DNS node names - https://phabricator.wikimedia.org/T101296#1334896 (10coren) 3NEW a:3coren [18:32:15] ok, deployment-salt puppet updated, running once manually before changing use_dnsmasq [18:49:03] YuviPanda: you told me about release and how I can use it in jsub but I forgot :P [18:50:36] Amir1: ah. -l release=trusty [18:50:53] valhallasw: ^ we should make that the default for tools that don't already have a job running in precise at some point [18:50:55] not sure how to best do that [18:57:03] thanks :) [18:57:34] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [18:58:08] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 22.22% of data above the critical threshold [0.0] [18:59:32] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 30.00% of data above the critical threshold [0.0] [18:59:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 50.00% of data above the critical threshold [0.0] [18:59:45] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 60.00% of data above the critical threshold [0.0] [18:59:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 50.00% of data above the critical threshold [0.0] [19:00:11] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL 66.67% of data above the critical threshold [0.0] [19:00:11] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 66.67% of data above the critical threshold [0.0] [19:00:11] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:00:23] andrewbogott: ^ this is us and not you, do not worry :) [19:01:19] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:01:29] YuviPanda: thanks :) [19:02:19] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:02:41] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:03:07] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:03:23] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:03:41] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:04:13] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 55.56% of data above the critical threshold [0.0] [19:05:46] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:05:52] PROBLEM - Puppet failure on tools-mail is CRITICAL 30.00% of data above the critical threshold [0.0] [19:06:22] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:06:30] PROBLEM - Puppet failure on tools-master is CRITICAL 20.00% of data above the critical threshold [0.0] [19:06:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:06:46] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:07:06] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 33.33% of data above the critical threshold [0.0] [19:07:18] PROBLEM - Puppet failure on tools-shadow is CRITICAL 40.00% of data above the critical threshold [0.0] [19:07:36] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 30.00% of data above the critical threshold [0.0] [19:08:02] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:08:47] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:08:57] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 60.00% of data above the critical threshold [0.0] [19:09:01] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:09:13] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 66.67% of data above the critical threshold [0.0] [19:09:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:09:33] YuviPanda: not sure how to do that without breaking stuff [19:09:35] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 22.22% of data above the critical threshold [0.0] [19:09:42] valhallasw: yeah [19:09:49] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 20.00% of data above the critical threshold [0.0] [19:09:58] YuviPanda: I think we should move to the 'set in manifest, start/stop with webservice' model, and then we can just change all manifests [19:10:04] valhallasw: +1 [19:10:31] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:10:45] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:10:50] valhallasw: that should be part of the 'one script runs as the user and does things' that you mentioned for tools-manifest [19:10:53] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 60.00% of data above the critical threshold [0.0] [19:10:53] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:10:58] *nod* [19:11:01] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:11:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:11:35] valhallasw: also, https://gerrit.wikimedia.org/r/#/c/215505/ :) [19:11:57] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:11:59] PROBLEM - Puppet failure on tools-redis-01 is CRITICAL 30.00% of data above the critical threshold [0.0] [19:12:05] YuviPanda: oh, cool [19:12:11] don't have time to look at it atm [19:12:18] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 44.44% of data above the critical threshold [0.0] [19:12:19] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:12:26] valhallasw: cool. do you think you'll have time anytime this week? [19:12:43] not sure, but I can probably give it a glance-over [19:13:02] valhallasw: yeah, I will try to nitpick to your quality before merging, but there are some arch changes I'd love to have a look over [19:13:52] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 50.00% of data above the critical threshold [0.0] [19:14:00] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 40.00% of data above the critical threshold [0.0] [19:22:33] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [19:24:41] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [19:24:45] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [19:25:09] RECOVERY - Puppet failure on tools-redis-02 is OK Less than 1.00% above the threshold [0.0] [19:25:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [19:26:19] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [19:27:11] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [19:28:07] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [19:28:25] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [19:29:14] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [19:29:34] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [19:29:46] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [19:30:10] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [19:32:42] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [19:32:52] has anyone got wikidata toolkit running on tools-labs, or any other remote server? [19:32:52] i would like to do this so that downloading the data dump is fast, or perhaps ever take from local storage? [19:33:06] RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0] [19:33:38] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [19:33:49] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [19:33:58] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [19:34:00] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [19:34:05] notconfusing: emailing labs-l is probably going to get a better answer. there's also wdq.wmflabs.org [19:34:12] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [19:34:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [19:35:42] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [19:35:47] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [19:35:51] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [19:35:51] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [19:35:53] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [19:35:59] RECOVERY - Puppet failure on tools-services-01 is OK Less than 1.00% above the threshold [0.0] [19:36:21] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [19:36:33] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [19:36:39] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [19:36:45] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [19:37:09] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [19:37:19] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [19:37:23] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [19:37:35] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [19:38:03] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [19:38:53] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [19:39:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [19:39:47] YuviPanda, that's a good idea [19:39:49] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [19:40:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [19:41:38] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [19:41:58] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [19:42:00] RECOVERY - Puppet failure on tools-redis-01 is OK Less than 1.00% above the threshold [0.0] [19:42:16] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [19:43:58] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [19:55:07] are the Wikidata json dumps available from tools-labs, it seems that there are only the xml ones? [19:55:07] well those are in the dumps/public/wikidatawiki folder, but I see that their public URL is http://dumps.wikimedia.org/other/wikidata/ [19:57:01] notconfusing: addshore was looking into them earlier, I think they're available somewhere atm not in the usual place and he's working on making them available from the usual place [19:57:23] notconfusing: https://phabricator.wikimedia.org/T100885 [19:58:27] YuviPanda, you just have all the answers! :) [19:58:31] Fantastic [20:11:55] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 60.00% of data above the critical threshold [0.0] [20:17:25] YESSS http://tools.wmflabs.org/lolrrit-wm/ :) [20:17:45] (I kille it, but still) [20:22:37] where's the tool i remember from the past that showed me global edit count of a user across all wikis [20:23:37] mutante: https://tools.wmflabs.org/guc/ ? [20:26:22] valhallasw: yes, thank you! [20:29:28] works, just odd that it claims it found all the edits in "1" project [20:34:14] YuviPanda: No moar failures that I can see. [20:34:31] YuviPanda: But also, not an unheard of issue: https://bugzilla.mozilla.org/show_bug.cgi?id=214625 [20:35:30] There's definitely a bug in the libc resolver, but it's little known because who *has* kilobyte-long lines in their hosts file? :-) [20:36:51] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [20:44:43] Coren: I haven't been able to access beta labs from the office all day. JonR says he can get to it from home though. Any idea what might be wrong? [20:45:07] kaldari: Hm, that seems odd. What host are you trying to reach exactly? [20:45:18] I can reach it okay, but then again I'm also home. [20:45:29] en.wikipedia.beta.wmflabs.org which resolves to 208.80.155.135 [20:45:48] or en.m.wikipedia.beta.wmflabs.org [20:46:45] Yeah, wfm here and from my colo server in ohio [20:46:59] I think you need to talk to OIT [20:47:36] Coren: thanks for checking. I'll ask OIT [20:47:38] Have you checked if others in the office have the issue? [20:55:43] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:56:55] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 30.00% of data above the critical threshold [0.0] [20:59:05] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 44.44% of data above the critical threshold [0.0] [20:59:21] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:00:13] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:00:33] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 40.00% of data above the critical threshold [0.0] [21:00:39] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:00:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 60.00% of data above the critical threshold [0.0] [21:01:10] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:01:10] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:02:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:03:14] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:03:42] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:04:06] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:04:38] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:04:58] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 20.00% of data above the critical threshold [0.0] [21:06:45] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:06:50] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:06:51] PROBLEM - Puppet failure on tools-mail is CRITICAL 40.00% of data above the critical threshold [0.0] [21:07:00] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:07:23] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 33.33% of data above the critical threshold [0.0] [21:07:39] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:07:43] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:07:53] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:08:07] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:08:23] PROBLEM - Puppet failure on tools-shadow is CRITICAL 50.00% of data above the critical threshold [0.0] [21:08:37] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 40.00% of data above the critical threshold [0.0] [21:09:45] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:10:01] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:10:13] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:10:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:12:40] uhh, is this an ldap thing that's happening? [21:12:49] or a network thing [21:13:11] I guess, what I'm saying is, I can't hit ldap in deployment-prep [21:13:19] thcipriani: I’ll look [21:13:29] YuviPanda: and Coren: trained me earlier to ignore those alerts today [21:13:56] kaldari: any difference between using guest wifi, other wifi or cabled network? because i think there is more than 1 provider for office and some goes via ULSFO and some does not [21:14:22] I'll try the others... [21:17:29] thcipriani: I have no issues reaching ldap. What exactly are you trying? [21:18:29] Coren: just doing some manual puppet runs on deployment-prep instances [21:18:40] thcipriani, kaldari i am getting an ldap error trying to use `become` from the office guest wifi [21:19:00] sudo: ldap_start_tls_s(): Connect error [21:19:01] sudo: a password is required [21:19:07] also: /usr/local/bin/ldap-yaml-enc.py deployment-salt.eqiad.wmflabs is returning Unable to connect to LDAP host: ldap://ldap-eqiad.wikimedia.org:389 ldap://ldap-codfw.wikimedia.org:389 [21:19:07] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [21:19:08] Ah. [21:19:20] andrewbogott: I can see it now. LDAP seems actually down. [21:19:42] Coren: yes, it was refusing connections from more and more instances. I’m restarting, we’ll see what happens [21:19:52] of course, restarting ldap causes public dns outage, always [21:19:58] So I will restart that too! [21:20:04] * andrewbogott can’t wait to rip out that dns server [21:20:16] mutante, Coren: The network doesn't matter, but the browser does. In Firefox http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page tries to redirect to the https site and fails. In Safari http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page loads fine, but https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page fails. I don't have any special Firefox extensions [21:20:16] installed. [21:21:47] kaldari: i can confirm the https site fails, that is a known issue, but i do not get the redirect from http->https (Iceweasel/Firefox) [21:22:57] Coren: /someone/ is able to talk to opendj because the log is whirring [21:23:21] do you guys know why https was disabled completely ? [21:23:26] as opposed to the cert error [21:23:31] that people could skip over [21:23:46] i saw it on phab somewhere alreayd the other day [21:24:23] kaldari: the https part is https://phabricator.wikimedia.org/T70387 [21:24:27] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [21:25:02] kaldari: the redirect is maybe cached somewhere [21:25:40] no wildcard cert for that: https://phabricator.wikimedia.org/T50501 [21:25:42] it's not httpseverywhere? [21:25:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [21:25:45] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [21:26:10] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [21:26:10] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [21:26:16] sitic: yes, but that task exists longer than the other one. there was a cert error but later it stopped listening altogether [21:26:58] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [21:27:04] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1410 is CRITICAL 37.50% of data above the critical threshold [0.0] [21:27:04] ah ok, never remembered it working [21:27:16] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [21:27:26] mutante: hmm, cleared my browser cache and OS DNS cache, but it still tries to redierct :P [21:28:10] oh well, I'll just use another browser for now [21:28:12] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [21:28:22] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 33.33% of data above the critical threshold [0.0] [21:28:24] the puppet failures are due to DNS being down? [21:28:32] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:28:48] I can't look at puppet logs because I can't sudo because can't connect to dns... [21:28:51] kaldari: you can try "about:config" and look for "network.dnsCacheExpiration" [21:28:54] and set it to 0 [21:29:00] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:29:07] andrewbogott: Hah. SSL ldap is broken, cleartext isn't. [21:29:23] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [21:29:28] paravoid: I saw a number of SSL changes go by from you earlier; any chance that may have tickled the labs ldap? [21:29:37] mutante: ^ same question [21:29:52] PROBLEM - Puppet failure on tools-exec-1406 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:30:00] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:30:03] mutante: no luck with that either [21:30:14] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [21:30:32] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [21:30:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:30:39] Coren: there’s also 51e0469b94bfb64ad3ca27fde6b3372d8ce5c559 [21:30:44] andrewbogott: eh.. 14:24 < mutante> do you guys know why https was disabled completely ? [21:30:45] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [21:30:48] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:31:03] mutante: you’re talking about ldap? [21:31:16] I thought you were discussing beta with kaldari [21:31:18] kaldari: you could add it as a new value http://en.kioskea.net/faq/555-disabling-the-dns-cache-in-mozilla-firefox [21:31:45] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:32:21] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [21:32:33] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:32:35] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:32:35] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:32:37] andrewbogott: yes, that was about beta. i don't know about LDAP changes [21:32:43] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [21:32:43] PROBLEM - Puppet failure on tools-exec-1213 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:32:45] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [21:32:57] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:33:01] PROBLEM - Puppet failure on tools-redis-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:33:01] PROBLEM - Puppet failure on tools-exec-1214 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:33:07] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [21:33:19] PROBLEM - Puppet failure on tools-exec-1211 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:33:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [21:33:23] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [21:33:38] is there a certificate error in the puppet run? [21:34:03] mutante: I’m checking that now [21:34:09] PROBLEM - Puppet failure on tools-submit is CRITICAL 22.22% of data above the critical threshold [0.0] [21:34:23] mutante: is 208.80.155.135 the correct IP it should resolve to? [21:34:23] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:34:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:34:29] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:34:31] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:34:36] mutante: nope [21:34:50] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:34:56] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [21:34:57] andrewbogott: I'm definitely unable to get anything through SSl from ldap [21:35:00] andrewbogott: any error when trying to re-start opendj? [21:35:12] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [21:35:12] is it still opendj? [21:35:20] mutante: Sadliy. [21:35:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [21:35:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:35:44] Coren: https://dpaste.de/8uyz [21:35:45] kaldari: yea, i get the same IP when resolving from over here [21:35:58] PROBLEM - Puppet failure on tools-exec-1219 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:36:08] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:36:15] andrewbogott: maybe a previous puppet run re-created the ssl cert already [21:36:20] mutante: probably [21:36:22] mutante: The response I get from that IP is http://pastebin.com/PJvLybjc, which has the Location header pointing to HTTPS [21:36:38] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL 40.00% of data above the critical threshold [0.0] [21:36:39] andrewbogott: which host is it on? [21:36:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:36:46] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:36:46] PROBLEM - Puppet failure on tools-trusty is CRITICAL 20.00% of data above the critical threshold [0.0] [21:36:50] neptunium and nembus [21:36:52] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [21:37:08] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL 33.33% of data above the critical threshold [0.0] [21:37:10] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 33.33% of data above the critical threshold [0.0] [21:37:11] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:37:22] PROBLEM - Puppet failure on tools-static-02 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:37:26] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:37:35] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 70.00% of data above the critical threshold [0.0] [21:37:45] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:37:55] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [21:37:57] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:37:59] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:38:01] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:38:05] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:38:35] Jun 3 21:38:26 neptunium pdns[26165]: Fatal error: Unable to bind to UDP socket [21:38:39] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:38:57] andrewbogott: coren: how about that one up there, not SSL but pdns [21:39:07] binding UDP socket to '208.80.154.19' port 53: Cannot assign requested address [21:39:17] mutante: Huh. [21:39:27] RECOVERY - Puppet failure on tools-webproxy-01 is OK Less than 1.00% above the threshold [0.0] [21:39:33] mutante: isn’t that just pdns complaining that ldap is down, because I restarted it? [21:39:57] pdns keeps trying to bind a socket but can't [21:40:07] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:40:11] respawning every couple seconds [21:40:23] PROBLEM - Puppet failure on tools-exec-1201 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:40:27] mutante: pdns is running /on/ neptunium? [21:40:29] andrewbogott: any SSL cert errors on opendj restart? [21:40:33] Hm… I can’t remember if that’s right or wrong. [21:40:38] andrewbogott: it is trying to [21:40:41] PROBLEM - Puppet failure on tools-services-02 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:40:52] mutante: no. See link above for output from an opendj restart [21:41:13] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 33.33% of data above the critical threshold [0.0] [21:41:31] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 50.00% of data above the critical threshold [0.0] [21:41:52] mutante: Good catch. I'm guessing there is a moribund pdns holding the port. [21:42:09] yikes [21:42:11] so, it is trying [21:42:14] binding UDP socket to '208.80.154.19' port 53: [21:42:23] but that IP is not on neptunium [21:42:25] ok, I’m killing pdns and restarting opendj [21:42:31] on neptunium [21:42:32] so restarting doesnt help [21:42:43] can someone give me an example of what's failing? [21:43:07] the messages in syslog _just_ stopped now [21:43:09] before it was: [21:43:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:43:20] mutante: that’s me, killing it. [21:43:21] Jun 3 21:41:43 neptunium pdns[26526]: binding UDP socket to '208.80.154.19' port 53: Cannot assign requested address [21:43:24] Jun 3 21:41:43 neptunium pdns[26526]: Fatal error: Unable to bind to UDP socket [21:43:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:43:44] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:43:52] mutante: Indeed, that IP isn't on that box. [21:44:04] wonders how restarting it changed things [21:44:14] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 70.00% of data above the critical threshold [0.0] [21:44:20] PROBLEM - Puppet failure on tools-shadow is CRITICAL 30.00% of data above the critical threshold [0.0] [21:44:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:44:25] That's virt1000 [21:44:31] I’m restarting it [21:44:32] or trying [21:44:52] paravoid: I don’t have any more specific info other than ‘nothing in labs can connect to ldap’. [21:45:00] And, dammit, now I can’t even start opendj. [21:45:05] * andrewbogott fixes things by breaking them worse [21:45:58] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 50.00% of data above the critical threshold [0.0] [21:46:14] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:46:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:46:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:46:35] paravoid: Unable to connect to LDAP host: ldap://ldap-eqiad.wikimedia.org:389 ldap://ldap-codfw.wikimedia.org:389 [21:46:50] got it [21:46:52] fixing [21:47:02] ok. Whatever I broke is now unbroken; opendj is running and responding to cleartext traffic just fine. [21:47:03] not server related, do not mess with opendj [21:47:06] Or so I judge from the log [21:47:10] ok, I will stop messing :) [21:47:34] PROBLEM - Puppet failure on tools-master is CRITICAL 20.00% of data above the critical threshold [0.0] [21:47:52] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:48:20] PROBLEM - Puppet failure on tools-exec-1216 is CRITICAL 55.56% of data above the critical threshold [0.0] [21:48:24] ffs [21:48:55] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:49:07] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:50:27] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:53:29] kaldari: i still don't get the https redirect like you do, also tried with curl and chrome. sure it's not httpseverywhere? [21:54:18] puppet fixed, salt running it everywhere [21:54:25] this is pretty broken in general [21:55:03] paravoid: What was the issue? [21:55:17] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:55:23] ldap.conf was pointing to /etc/ssl/certs/GlobalSign_CA.pem as the CA [21:55:27] which is pretty broken [21:55:35] that's a single intermediate CA, not a certificate store [21:56:04] also, OpenDJ seems to serve just its certificate, not a chain up to a root [21:56:07] also quite broken [21:56:09] mutante: Yeah, works fine for me in Chrome too, no idea why Firefox is screwed up. I don't have httpseverywhere or anything special installed as far as I can tell. Oh well :P [21:56:25] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:56:52] paravoid: coren: confirmed it works again on neptunium [21:56:55] paravoid: ldap looks to be working now — thanks for your quick response. [21:57:04] Should I open a task to sort out the cert situation? [21:57:12] I see everything working from instance-level as well. [21:57:15] PROBLEM - Puppet failure on tools-exec-1218 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:57:27] lmodules/labs_bootstrapvz/files/labs-jessie.manifest.yaml also has a GlobalSign_CA reference, so it will fail [21:57:51] PROBLEM - Puppet failure on tools-exec-1203 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:57:58] I need to restart pdns now, of course [21:58:01] like clockwork [21:58:04] there is no easy way to fix this, the easiest one would be to fix OpenDJ to serve the full chain [21:58:32] kaldari: look at your cookies. There is some cookie that gets set for some people some times that tries to force https to beta cluster. Once it's there you have to delete it manually or the frontend nagios/varnish will keep rediring you to the non-existent https endpoint [21:58:37] or switch to a different ldap implemetation [21:58:47] or that, that's a longer project :) [21:58:58] the puppet classes for ldap are also quite the mess [22:00:01] bd808: YES! that fixed it [22:01:58] we had lots of reports of that problem >1 year ago when we moved to eqiad [22:02:18] because in ptmpa https to beta cluster actually worked [22:03:18] paravoid: when you say ‘fix OpenDJ to serve the full chain’ do you mean fix opendj source, or is it just a config issue? [22:03:26] config issue :) [22:04:07] I seem to have missed this storm [22:04:13] (Saw the catch point alerts) [22:04:17] paravoid: ok, that’s not terrible then. [22:04:59] andrewbogott: Coren did you also get the catch point alerts? [22:05:07] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [22:05:46] yes, although I didn’t pay much attention since I was already into it [22:05:54] YuviPanda: I did; though I was sitting on a box when things started to melt so I didn't need to be told. :-) [22:06:24] andrewbogott: Coren yeah cool :) just wanted to check if the addition done earlier worked :) [22:06:46] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [22:06:48] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK Less than 1.00% above the threshold [0.0] [22:07:10] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [22:08:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [22:08:36] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [22:09:16] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [22:10:24] RECOVERY - Puppet failure on tools-exec-1201 is OK Less than 1.00% above the threshold [0.0] [22:10:40] RECOVERY - Puppet failure on tools-services-02 is OK Less than 1.00% above the threshold [0.0] [22:11:16] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [22:11:32] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [22:11:46] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [22:12:44] coren, yuvi, I’m about to go out for dinner. I apologize in advance for whatever misfortune results :( [22:12:57] andrewbogott: it's ok, you can buy me alcohol next time we meet :) [22:13:02] * ArchaeologistPan accepts andrewbogott's apologies [22:13:08] ArchaeologistPan: Oh, there you are :) [22:13:13] Heh. Go ahead. Sometimes things don't break when you are away. [22:13:15] :-) [22:13:18] andrewbogott: :) I found a floppy disk! [22:13:19] anyway [22:13:22] RECOVERY - Puppet failure on tools-exec-1216 is OK Less than 1.00% above the threshold [0.0] [22:13:29] Coren: can you email labs-l? a few people have opened bugs :) [22:13:36] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [22:13:36] * ArchaeologistPan disappears again, mostly [22:13:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK Less than 1.00% above the threshold [0.0] [22:13:44] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [22:13:46] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [22:13:53] the good news is, the switch to new DNS went OK today; beta and staging are migrated now and seem OK. [22:14:01] ArchaeologistPan: Yeah, I have two email incoming. I have to grab a bite first. [22:14:06] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [22:14:06] RECOVERY - Puppet failure on tools-exec-1403 is OK Less than 1.00% above the threshold [0.0] [22:14:22] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [22:14:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [22:14:36] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [22:14:46] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [22:15:00] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [22:15:18] ok [22:15:26] RECOVERY - Puppet failure on tools-webproxy-01 is OK Less than 1.00% above the threshold [0.0] [22:15:59] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [22:16:12] RECOVERY - Puppet failure on tools-exec-1401 is OK Less than 1.00% above the threshold [0.0] [22:16:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [22:16:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [22:16:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [22:16:54] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [22:17:02] RECOVERY - Puppet failure on tools-services-01 is OK Less than 1.00% above the threshold [0.0] [22:17:04] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1410 is OK Less than 1.00% above the threshold [0.0] [22:17:32] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [22:17:53] RECOVERY - Puppet failure on tools-exec-1405 is OK Less than 1.00% above the threshold [0.0] [22:18:19] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [22:18:21] RECOVERY - Puppet failure on tools-exec-1211 is OK Less than 1.00% above the threshold [0.0] [22:18:53] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [22:18:59] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [22:19:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [22:19:51] RECOVERY - Puppet failure on tools-exec-1406 is OK Less than 1.00% above the threshold [0.0] [22:19:59] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [22:20:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [22:20:49] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK Less than 1.00% above the threshold [0.0] [22:22:16] RECOVERY - Puppet failure on tools-exec-1218 is OK Less than 1.00% above the threshold [0.0] [22:22:22] RECOVERY - Puppet failure on tools-static-02 is OK Less than 1.00% above the threshold [0.0] [22:22:28] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [22:22:32] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [22:22:34] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [22:22:36] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [22:22:36] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [22:22:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [22:22:56] RECOVERY - Puppet failure on tools-mailrelay-01 is OK Less than 1.00% above the threshold [0.0] [22:22:58] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [22:23:01] RECOVERY - Puppet failure on tools-exec-1214 is OK Less than 1.00% above the threshold [0.0] [22:23:03] RECOVERY - Puppet failure on tools-redis-01 is OK Less than 1.00% above the threshold [0.0] [22:23:05] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [22:23:05] RECOVERY - Puppet failure on tools-exec-1217 is OK Less than 1.00% above the threshold [0.0] [22:24:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [22:24:28] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [22:24:32] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [22:24:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [22:25:15] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [22:25:34] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [22:25:58] RECOVERY - Puppet failure on tools-exec-1219 is OK Less than 1.00% above the threshold [0.0] [22:26:08] RECOVERY - Puppet failure on tools-redis-02 is OK Less than 1.00% above the threshold [0.0] [22:26:26] RECOVERY - Puppet failure on tools-exec-1206 is OK Less than 1.00% above the threshold [0.0] [22:26:38] RECOVERY - Puppet failure on tools-exec-gift is OK Less than 1.00% above the threshold [0.0] [22:26:42] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [22:27:08] RECOVERY - Puppet failure on tools-exec-1408 is OK Less than 1.00% above the threshold [0.0] [22:27:10] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [22:27:50] RECOVERY - Puppet failure on tools-exec-1203 is OK Less than 1.00% above the threshold [0.0] [22:27:57] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [22:28:35] RECOVERY - Puppet failure on tools-exec-1404 is OK Less than 1.00% above the threshold [0.0] [22:29:21] shinken-wm: sorry, gotta ignore you now [22:36:55] hmm. sudo works on my labs node, but also prints an error: [22:36:57] $ sudo echo test [22:36:57] sudo: ldap_start_tls_s(): Connect error [22:36:57] test [22:39:02] hmm i restarted nscd and it got worse :( [22:39:03] $ sudo echo test [22:39:03] sudo: unknown uid 4177: who are you? [22:40:02] [terbium:~] $ ldaplist -l passwd gage | grep uidNumber uidNumber: 4177 [22:40:09] terbium knows you [22:41:16] jgage: that's like the LDAP issue that paravoid fixed [22:41:22] but works for me [22:41:49] i just logged into an instance i hadn't touched since before the problem, and i get the same ldap_start_tls_s() error [22:42:41] RECOVERY - Puppet failure on tools-exec-1213 is OK Less than 1.00% above the threshold [0.0] [22:42:51] i wonder if i dare checking whether a reboot fixes this [22:42:55] i don't want to get locked out [22:44:09] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [22:44:51] could not resolve hostname bastion-restricted-eqiad.wmflabs.org: [22:45:10] does this use an up-to-date puppet? [22:45:15] or is it a ::self instance? [22:45:19] ::self [22:45:24] well there's your problem :) [22:45:27] heh [22:45:34] the recent email said the changes would happen next monday :\ [22:45:43] I salt-fixed this, but puppet undid the fix [22:45:47] rebase off master [22:45:49] k [22:46:47] mutante: yeah that dns lookup fails for me too :( [22:47:16] yea, so can't connect to my instance either but other error [22:47:21] hm it works from public [22:48:09] not for me right now [22:48:11] i get the right answer from labs-ns1 but not from ns0 [22:49:01] jgage: confirmed that [22:49:07] ns1 yes, ns0 no [22:53:34] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 40.00% of data above the critical threshold [0.0] [22:53:34] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 22.22% of data above the critical threshold [0.0] [22:55:48] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 50.00% of data above the critical threshold [0.0] [22:55:56] andrewbogott_afk: T101317 is fixed [22:56:42] wikibugs appears to be silent, and in need of a gentle percussive maintenance. (no activity in -dev for a few hours) [23:15:17] "Things are down?" = {tools-login,bastion}.wmflabs.org does not resolve? [23:15:39] scfc_de: yes, ns0 fails, ns1 does not [23:16:02] this isnt just about restarting wikibugs [23:22:59] mutante: Is there someone working on that or a Phabricator task? [23:23:34] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [23:23:34] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [23:24:08] scfc_de: no [23:25:48] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [23:30:30] Hm. ns0 having issues then? [23:35:28] scfc_de: I have no issues resolving from here. [23:36:04] Coren: i do [23:36:14] but only ns0 [23:36:18] and pdns is running [23:37:47] So I see. [23:39:44] Jun 3 22:24:51 virt1000 pdns[21218]: Database module reported condition which prevented lookup (LDAP server unreachable) sending out servfail [23:40:08] Though that is some time ago. [23:40:11] * Coren digs [23:40:44] So the hiccup of the LDAP server earlier (?) caused pdns to remove that from its "source list"? [23:41:01] scfc_de: It might. [23:41:17] Jun 3 23:40:56 virt1000 pdns[19125]: [LdapBackend] Ldap connection succeeded [23:41:32] Giving it a swift kick in the diodes seems to have helped. [23:41:50] Ah, and indeed ns0 is now responsive again. [23:41:51] labs-ns0 is resolving again, it seems like. [23:42:31] Hey guys I just finished a Python script I'd like to host for on-demand use on Labs. [23:42:46] And I can log in again. Thanks! [23:44:33] It's a script that constructs a draft of the featured content report in the Signpost. Scrubs information on pages, nominations, and nominators, saving ~20 minutes of manual work per week. [23:44:40] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 30.00% of data above the critical threshold [0.0] [23:45:08] How difficult would it be to set it up on Labs? Some docs to read on the subject, perhaps? [23:45:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 40.00% of data above the critical threshold [0.0] [23:46:36] ResMar: tools has some issues right now, but start by taking a look at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Quick_start [23:47:01] Uh to be honest [23:47:14] Last I heard there wasn't a time that Labs didn't have an issue. [23:47:28] Thanks. [23:47:46] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 66.67% of data above the critical threshold [0.0] [23:47:50] afaict, there are no issues left atm. [23:47:52] maybe you only hear from it when it has an issue ;-) [23:48:29] That is definetly true in general.