[01:22:04] Coren: ok, i think I have it sorted… updated instructions at https://wikitech.wikimedia.org/wiki/Ldap_rename [03:40:25] 3Wikimedia Labs / 3deployment-prep (beta): Template:Artwork does not contain templatedata - 10https://bugzilla.wikimedia.org/71340 (10dan) 3NEW p:3Unprio s:3normal a:3None steps to reproduce ------------------ http://commons.wikimedia.beta.wmflabs.org/w/api.php?action=templatedata&titles=Template:Art... [06:20:24] 3Wikimedia Labs / 3deployment-prep (beta): wikidata beta (item pages, etc.) inaccessible with 503 errors - 10https://bugzilla.wikimedia.org/69708 (10Ori Livneh) 5NEW>3RESO/WOR [06:22:38] 3Wikimedia Labs / 3deployment-prep (beta): HHVM crash logs need to go somewhere more visible than /tmp on the apache hosts - 10https://bugzilla.wikimedia.org/68459#c2 (10Ori Livneh) 5NEW>3RESO/FIX We're sending the logs to syslog now and forwarding them to fluorine and its beta equivalent. The log files... [06:24:52] 3Wikimedia Labs / 3deployment-prep (beta): beta: ResourceLoader CSS URL gzipped twice, causing skins to be broken - 10https://bugzilla.wikimedia.org/68720#c3 (10Ori Livneh) 5NEW>3RESO/FIX Fixed in 4bdfea04cea2e54e6076a1c6b955ba1e1b51209f. [06:34:07] !log deployment-prep updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d [06:34:11] Logged the message, Master [08:17:52] 3Tool Labs tools / 3[other]: hikebikemap utf8 miscoding - 10https://bugzilla.wikimedia.org/71173#c8 (10Peter Gervai (grin)) 5NEW>3RESO/WOR (In reply to kakrueger from comment #7) > I am not exactly sure what happened to cause this issue in the first place, > but I am hoping it should be fixed now. That... [10:29:07] 3Wikimedia Labs / 3Infrastructure: CatScan doesn't load (times out) - 10https://bugzilla.wikimedia.org/71336#c1 (10Andre Klapper) 5NEW>3UNCO Loading https://tools.wmflabs.org/catscan2/catscan2.php works for me - I get the CatScan V2.0β page with lots of nice fields, and it also shows results. [10:33:09] 3Wikimedia Labs / 3Infrastructure: CatScan doesn't load (times out) - 10https://bugzilla.wikimedia.org/71336#c2 (10Nemo) 5UNCO>3RESO/FIX Yep, seems it was fixed at last. Thanks to the anonymous lifesaver! [10:51:26] wikitech admin: https://wikitech.wikimedia.org/wiki/Special:Contributions/Iwffxjdmor spam [11:54:38] 3Wikimedia Labs / 3tools: deletion queries joined with tokudb replication tables are really slow - 10https://bugzilla.wikimedia.org/68918#c2 (10Silke Meyer (WMDE)) Hey Sean, what's the status on this? For Merl's migration from Toolserver this is really important. Without working delete queries, he'd have an... [12:04:41] I will let you know when I see andrewbogott around here [12:04:41] @notify andrewbogott [13:44:11] YuviPanda: Do you have any projects with that use a local puppetmaster? [13:44:19] hmm [13:44:23] I do, actually [13:44:29] quarry has one machine on a local puppetmaster [13:44:32] shinken also local puppetmaster [13:44:51] bunch of old instances in graphite project that are local puppetmaster... [13:44:56] So, shinken project has a project-wide puppetmaster? [13:45:10] andrewbogott: nope, I just use one machine there, and it is self puppet [13:45:15] I'm looking for an example that has a single role::puppet::self instance with more than 0 clients. [13:45:16] ah, no projectwide puppetmasters for me [13:45:27] OK, then beta will have to be the beta :) [13:45:27] no, none of mine have that [13:45:35] andrewbogott: integration also has its own puppetmaster [13:45:38] Or I guess I can just apply this to labs and hope it works [13:45:50] andrewbogott: betalabs sounds better start :D [13:45:58] andrewbogott: and hopefully I can grab you after for icinga mertes [13:45:59] *merges [13:46:03] hashar, around? [13:46:13] YuviPanda: maybe. I'm going to be pretty tense about ldap for a while yet [13:46:23] andrewbogott: ah, ok. [13:46:34] * YuviPanda is on a flight for a few days for the next week, so was hoping to get these merged today [13:46:45] Hm, who else mucks with the beta puppetmaster... [13:47:52] andrewbogott: yes [13:48:06] hashar: have you read my IMPORTANT: ldap email on labs-l? [13:48:35] * hashar andrewbogott: not yet, doing so right now :] [13:48:56] (( If you work in a non-tools project and you have a local puppetmaster )) [13:48:57] thanks. I'd like to merge those patches on integration before I do so on all of labs. Just as one last test to be absolutely sure... [13:48:57] damn [13:49:10] that mails apply to beta cluster and integration labs projects :D [13:49:10] andrewbogott: just curious, would root keys still work? [13:49:14] backpeddling in one project won't be as terrifying as backpeddling all of labs, if something goes horribly wrong. [13:49:19] YuviPanda: yeah, they will. [13:49:22] ah cool [13:49:29] So it won't break labs forever, it'll just break my weekend [13:50:29] hashar, I'd like you to pick one of those two projects (depending on which outage will be less disruptive) and merge those patches on the puppetmaster. [13:50:55] andrewbogott: is that wmfusercontent.org cert related to ldap? https://gerrit.wikimedia.org/r/#/c/159740/1 [13:51:31] hashar: yes. That cert is needed to talk securely to the new ldap server [13:51:45] so the new ldap is under wmfusercontent.org domain right? [13:51:45] that's the first of the two patches I mention, right? [13:51:49] yeah [13:51:53] which has been merged [13:51:59] so I just need to rebase the puppet repos [13:52:06] hashar: no, it's the intermediate cert that's required. [13:52:21] OHHH [13:52:24] the wmfusercontent thing will just ride along because it's in the same patch. Shouldn't matter. [13:53:01] !log integration rebasing puppetmaster [13:53:05] Logged the message, Master [13:53:21] hashar: because I'm using integration as a test case, you'll need to cherry-pick the second patch [13:53:26] as it isn't merged yet. [13:53:36] If all goes well I'll merge it, and the a fetch/rebase will be enough for all subsequent caes. [13:53:39] cases. [13:55:53] andrewbogott: both deployment-prep and integration have the new intermediary CA cert now :] [13:55:58] well at least their puppet master [13:56:26] hashar: ok, the next step is to cherry-pick https://gerrit.wikimedia.org/r/#/c/162689/ onto one of the puppet masters... [13:56:34] you could review that before you cherry-pick if you want :) [13:56:42] yeah doing so [13:56:43] YuviPanda, I wouldn't mind a review from you as well [13:57:10] hashar: there's some cruft in there because it's important to get the ordering right. Ldap has to be fully changed before other things happen or puppet breaks and… doom. [13:58:46] not that deployment-prep has its own salt master :/ [13:59:35] andrewbogott: looking now [14:01:57] hashar: I thought it did? And, anyway -- is that related to this patch? [14:02:01] andrewbogott: would it make sense to decouples things a bit ? [14:02:12] hashar, how so? [14:02:13] i.e. have the new cert installed , then the new require => stanza [14:02:22] and finally a third change that does the switch [14:02:32] but maybe that is too much splitting [14:02:44] Oh, maybe -- I was trying to keep things in one big patch because that makes for a single cherry-pick, easier instructions for people who are updating by hand. [14:03:15] I am a huge fan of staged tiny deploys :] [14:03:29] me too, when I'm doing them. [14:03:38] hashar: hi. [14:04:25] hashar: I'd like to write a tool which tetches fresh category members. better as a tool or as an extension? it doesn't need to write to the db. [14:04:33] it'd fetch recursively. [14:04:39] andrewbogott: if you get the a change with the certs + requires deployed. You can then ensure they are present on all instances via a salt command that looks for /etc/ssl/certs/GlobalSign_CA.pem [14:05:02] andrewbogott: this way you somehow have an idea of which instances are not self updating / use puppetmaster self. Ie a list of potential breaker later on [14:05:03] That's already divided into two patches, right? [14:05:19] The patch that installs the cert is already installed everywhere that has puppet running... [14:05:23] the first patch adds the ssl cert in puppet, but it is not installed / copied on the instances [14:05:37] at least I can't find it under /etc/ [14:05:44] oh! Hm... [14:05:56] root@integration-puppetmaster:~# find /etc/ -name GlobalSign_CA.pem [14:05:56] root@integration-puppetmaster:~# [14:06:13] so I would extract from https://gerrit.wikimedia.org/r/#/c/162689/10 the bits that get the new cert deployed [14:06:37] Svetlana: that might already be build in Mediawiki core :] [14:07:17] hashar: I think you are right, but because I've already emailed instructions I'd prefer not to vary them. [14:07:33] Unless you feel VERY strongly. I've tested that patch a bunch of times on standalone instances, the ordering is correct. [14:07:45] Svetlana: what is the use case? [14:07:58] andrewbogott: sure :-] [14:08:11] andrewbogott: let me stop puppet on all integration instances and attempt to run it on the puppet master :] [14:08:19] thx [14:08:21] hashar: new page patrol. [14:08:23] assuming you have an hour ahead to unscrew it :D [14:08:30] yep, a whole day [14:08:44] hashar: I don't think it's in the core. there is the api for it, but no interface, and the thing it has is not recursive iirc. [14:08:50] Svetlana: probably better to talk about your idea on wikitech-l . I barely do any mediawiki development nowadays [14:09:07] Svetlana: + gotta break a labs project :] [14:09:21] hashar: what channel on irc is good? #mediawiki is deaf to that. and what break? [14:09:48] Svetlana: #mediawiki or #wikimedia-dev [14:10:37] andrewbogott: do you have any magic salt command to disable puppet on all instances of the 'integration' project? [14:10:48] andrewbogott: else will ssh to each and puppet agent --disable :] [14:11:00] ok. [14:11:13] hashar: I think mutante set up a way to do salt commands per-project but I don't know how they work offhand :( [14:11:23] andrewbogott: will brute force so :] [14:11:33] I should get dsh on my machine [14:15:12] integration-zuul-server.eqiad.wmflabs: rcmd: getaddrinfo: nodename nor servname provided, or not known [14:15:15] holy shit dsh [14:15:44] hashar: hang on on that cherry-pick, I just found a mistake [14:18:19] * bmansurov is back (gone 00:01:34) [14:18:34] * bmansurov is away: I'm busy [14:19:10] !log integration disabled puppet on all instances [14:19:13] Logged the message, Master [14:20:15] andrewbogott: you can use integration-puppetmaster.eqiad.wmflabs :] I have disabled puppet on all the instances [14:20:19] hashar: will be 10-15 minutes while I verify that the new patch works on a standalone box. Stay tuned... [14:20:25] sure [14:21:31] YuviPanda: I added a few comments :) [14:21:38] yay :) [14:21:39] * YuviPanda looks agin [14:21:41] *aign [14:21:42] gah [14:21:43] agahin [14:21:45] wut [14:21:46] again [14:22:10] nature's call [14:38:22] hashar: ok! patchset 11 works for me. Shall I do the cherry-pick or do you want to do it? [14:38:32] andrewbogott: do it :] [14:38:43] on integration-puppetmaster [14:38:48] /var/lib/git/operations/puppet [14:38:58] it has a few local hacks but they should not conflict [14:39:21] not puppet agent is disabled there (just like all the other integration instances) [14:45:22] hashar: can I reboot integration-puppetmaster? (I don't really need to, just being cautious) [14:45:36] andrewbogott: sure [14:45:43] andrewbogott: it only serves as a puppet master [14:49:36] hashar: ok -- now, what node should I test on first? The master looks good. [14:49:51] Coren: ping me when you're up and about? [14:50:39] andrewbogott: integration-dev :] [14:50:47] integration-dev.eqiad.wmflabs [14:53:44] hashar: ok, can you log into that box and confirm that it looks fine to you? [14:53:52] I'll be back in 5 [15:03:36] months [15:04:34] andrewbogott: integration-dev seems fine, though it hasn't been rebooted. [15:04:47] hashar: yeah, I only rebooted the puppetmaster. [15:04:55] I guess we should try that one too -- mind rebooting and confirming? [15:14:01] andrewbogott: yup do reboot it :] [15:27:03] hashar: ok, I've verified that logins fail if I stop both of the new ldap servers, but logins work if either of them is up. [15:27:10] yay [15:27:16] So -- all good! Go ahead and reenable puppet on the other nodes. [15:27:28] I'm going to merge this patch in a moment, and then watch toollabs like a hawk [15:30:05] andrewbogott: beta cluster might need some work as well [15:30:24] hashar: yep, it'll need a rebase. [15:30:43] andrewbogott: it is done more or less automatically via a cron that runs once per hour [15:30:47] but it choke on conflicts [15:30:59] hashar: ok, must likely will just work then [15:34:45] !log integration reenabling puppet agent on all integration instances [15:35:18] poor logger [15:36:32] I cannot find my project listed in https://wikitech.wikimedia.org/wiki/Special:NovaInstance [15:36:48] something broke it ? and no Openstack links in the sidebar too [15:36:57] andrewbogott: sounds like a job well done :] [15:36:59] tonythomas: log out and log back in? [15:37:05] mmmmmaybe [15:37:06] tonythomas: log out / log back in. known bug :] [15:37:12] YuviPanda: of course. how could I forget [15:37:33] YuviPanda: I have reenabled puppet on the integration instances so the alarms should clear up [15:38:14] oh no. it says password incorrect :\ [15:38:22] I never remember changing my password though [15:38:33] hashar: the alarm was for only one host (integration-dev-trusty), and it wasn't a puppet stale alarm (puppet has not run for a while) but a puppet *failure* alarm, so puppet ran but had some failure events. [15:39:33] tonythomas: I can't login to wikitech either. [15:39:37] is anyone else having the password incorrect issue ? [15:39:39] might be ldap issues. [15:39:47] yeah. looks like [15:39:50] meh [15:41:10] YuviPanda: ah dev-trusty is some instance set up by Timo not sure in which state it is really :( [15:41:33] hashar: can you give me projectadmin on that instance, and I can investigate? [15:43:13] I just tried tonythomas01@deployment-bastion:~$ mysql loginwiki [15:43:13] ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [15:43:34] and get this! but I could get through the same a few days back though [15:43:47] :~$ [15:44:50] tonythomas: use sql [15:44:52] sql enwiki [15:45:01] that gives the proper credential to connect to the proper sql server [15:45:13] YuviPanda: doing so [15:45:16] hashar: oh. my [15:47:10] hashar: tonythomas is on deployment-bastion, no 'sql' ther, I think [15:47:27] Wikitech logins are broken, I'll investigate in just a moment [15:48:15] YuviPanda: no. it looks like :~$ sql loginwiki worked. [15:48:19] ooh, and gerrit is broken too, that's going to make it hard to fix this :) [15:48:21] oh [15:48:22] nice [15:48:41] hashar_: can you use your superpowers to merge a patch for me without the gerrit web interface? [15:49:02] andrewbogott: I don't have super power on operations/puppet.git :D [15:49:26] andrewbogott: which change ? :D [15:49:36] https://gerrit.wikimedia.org/r/#/c/163183/ [15:49:37] for starters [15:50:06] andrewbogott: I'm sorta back and mostly lucid. Can I help? [15:50:12] andrewbogott: I can't merge that :-/ [15:50:20] hashar_: can you advise me about how to do it? [15:50:30] andrewbogott: hooo LDAP is broken [15:50:32] andrewbogott: :D [15:50:35] Coren: sure, you can figure out what the deal is with labs-ns1 [15:50:51] I think ops are granted rights via the ldap/ops magic thing [15:51:02] https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet,access [15:51:10] ahah [15:51:18] * Coren reads scrollback. [15:51:25] andrewbogott: we are screwed I am afraid [15:51:35] Coren: there's nothing in the scrollback, it's unrelated to the current issue. [15:51:42] hashar_: Can't I log into gerrit? [15:51:57] will look whether I can change permissions on operations/puppet [15:52:44] andrewbogott: Gimme the one-liner summary of the symptom and I'm on it. [15:53:12] Coren: it may be nothing. All I know is "PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call " [15:53:22] hashar_: I may be able to just fix ldap instead, give me a moment [15:54:36] andrewbogott: trying to see whether I have enough credentials to add you to ops/puppet.git [15:55:36] hashar_: where can I look to see what gerrit's beef is with ldap? [15:55:48] because ldap /should/ be working now [15:57:46] andrewbogott: pdns config is hosed? There's a parameter in /etc/powerdns/pdns.conf that isn't understood by the daemon so it refuses to start, and it tries to bind explicitly to an IP the box doesn't have. Looks like the config is for another box. [15:58:35] Coren: I'd expect pdns to need a restart after ldap things change. But I don't know much about the config otherwise [16:02:56] andrewbogott: I managed to fix the config locally, but puppet is probably going to trample all over it again. Part of its config was leftover from virt1000 it seems. [16:04:06] Coren: wait! [16:04:11] labs-ns1 is supposed to be virt1000 [16:04:20] I didn't move it as far as I know. [16:04:25] What is it pointing to now? [16:04:30] neptunium [16:04:37] Can you tell why? [16:04:51] There shouldn't even be dns on neptunium [16:04:55] labs-ns1.wikimedia.org has address 208.80.154.19 [16:05:29] 2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000 [16:05:29] inet 208.80.154.6/26 brd 208.80.154.63 scope global eth0 [16:05:29] inet 208.80.154.19/32 scope global eth0 [16:05:39] Definitely neptunium [16:05:50] neptunium is 208.80.154.6 [16:06:00] * Coren points up. [16:06:28] OOooo. Do you have two boxen with the same IP? [16:07:02] 2: eth0: mtu 1500 qdisc mq state UP qlen 1000 [16:07:02] inet 208.80.154.18/26 brd 208.80.154.63 scope global eth0 [16:07:02] inet 208.80.154.19/32 scope global eth0 [16:07:05] You do. [16:07:30] Accidental anycast! [16:07:43] which IP? [16:07:46] * Coren fixes /that/ [16:08:02] I'm going to leave you to this but will be interested in a postmortem [16:08:06] Both neptunium and virt1000 ended up with 208.80.154.19 [16:08:22] Apparently, only virt1000 should have. [16:09:36] * Coren tries to figure out if puppet did it. [16:10:35] Coren: a dns class was briefly present on neptunium, then I removed it. Possibly it did something during that time, in which case you won't see a trace of it in puppet [16:11:06] andrewbogott: That may be it; the class must add the secondary IP in /etc/network/interfaces [16:11:22] Coren: ok, that would explain that part of the story. [16:11:40] I removed it and puppet isn't trying to put it back. labs-ns1 should return to normal shortly once the ARP cache pollution times out. [16:11:46] Great. Thank you! [16:12:18] OK, next two issues: wikitech and gerrit both won't talk to ldap. I know not why. Which would you prefer? [16:12:49] gerrit seems to me to be the bigger problem since it blocks fixing anything else. I'll jump on that. [16:14:38] ok [16:14:52] I suspect that as soon as we figure out where the damn logfiles are it'll be trivial [16:17:39] too many channels [16:17:56] Coren: andrewbogott in Gerrit I created a user group named "ops-labs" and added your accounts to it [16:17:58] hashar_; more than 120? or just confused client? [16:18:03] that should let you merge in ops/puppet [16:18:21] Svetlana: enough for me to end up being confused :] [16:18:40] ok, not 120+, this is good (btw I use quassel and it scales reasonably well) [16:20:32] andrewbogott: There doesn't seem to /be/ any logging beyond apache's [16:20:42] yeah :( [16:21:19] gerrit logs in /var/lib/gerrit2/review_site/logs [16:21:45] and I have no idea how to make it log / debug ldap messages [16:23:39] Just for a second opinion… Coren, hashar_, y'all can't log into wikitech, right? [16:23:49] trying [16:23:53] * Coren tries [16:24:23] Indeed. [16:24:37] yeah broken [16:24:58] could it be a network / firewall issue? [16:26:47] Coren: andrewbogott I gotta rush out [16:27:00] at worth you can cherry pick patches directly on the puppet master [16:27:37] Verify return code: 21 (unable to verify the first certificate) [16:27:42] andrewbogott: ^^ [16:28:10] Coren: ok, that shouldn't be a critical, but I can probably fix it anyway :) [16:28:23] gerrit might dislike [16:29:03] No worse than self-signed though. :-( [16:29:20] I am off, might show up later tonight [16:29:39] if anything urgent, ring my phone :] [16:29:40] Coren: better? [16:29:46] hashar_: thanks! [16:30:18] andrewbogott: Same difference. Lemme first make sure that's actually the problem. [16:30:26] hm, which cert doesn't it like? [16:31:04] The globalsign one. Maybe not in its trusted roots or there's a missing link in the chain? [16:33:27] hm, looks like the wikitech issue is similar. [16:33:38] I tested this 1000 times on labs, why is it different here? [16:34:00] Anyway, let me know if you figure out what the complaint is. The full chain should be there... [16:36:25] afaict, the ldap communication is denied during SASL nego [16:36:53] And openssl s_client returns the "unable to verify the first certificate" error. [16:37:27] This is what I get as chain: [16:37:29] Certificate chain [16:37:30] 0 s:/C=US/ST=California/L=San Francisco/O=Wikimedia Foundation, Inc./CN=ldap-codfw.wikimedia.org [16:37:30] i:/C=BE/O=GlobalSign nv-sa/CN=GlobalSign Organization Validation CA - SHA256 - G2 [16:37:50] Coren: I just replaced the ldap-eqiad cert with a chained cert. Does that make a difference? [16:38:21] Certificate chain [16:38:22] 0 s:/C=US/ST=California/L=San Francisco/O=Wikimedia Foundation, Inc./CN=ldap-eqiad.wikimedia.org [16:38:22] i:/C=BE/O=GlobalSign nv-sa/CN=GlobalSign Organization Validation CA - SHA256 - G2 [16:38:40] same [16:38:44] oh, wait! [16:38:48] codfw? [16:39:01] I tried equiad after [16:39:06] ok [16:39:10] http://dpaste.com/2023BAQ [16:39:18] Anyway, that chain is correct, isn't it? [16:39:35] 'unable to get local issuer certificate' [16:40:31] "/C=BE/O=GlobalSign nv-sa/CN=GlobalSign Organization Validation CA - SHA256 - G2" doesn't look like a root cert to me. [16:40:48] There should be an intermediate and a root. [16:41:22] ldap-eqiad.wikimedia.org, globalsign_ca.pem, globalsign_root_ca.pem [16:41:26] are all in my truststore [16:41:31] also wmf-ca [16:43:04] wikitech should be back up and happy now [16:43:23] which means tls /can/ work with ldap-eqiad... [16:43:34] Coren: can you tell what ldap server gerrit is using? [16:46:52] 3Wikimedia Labs / 3deployment-prep (beta): Template:Artwork does not contain templatedata - 10https://bugzilla.wikimedia.org/71340 (10Greg Grossmeier) p:5Unprio>3Normal [16:49:17] Coren: looks like gerrit is only hitting codfw. I'm going to restart it and see if it makes one attempt at eqiad before failing over [16:49:52] 3Wikimedia Labs / 3deployment-prep (beta): Template:Artwork does not contain templatedata - 10https://bugzilla.wikimedia.org/71340#c1 (10Greg Grossmeier) Works now? What'd you do?! ;) [16:53:04] Per ldapsearch: [16:53:07] TLS: peer cert untrusted or revoked (0x42) [16:53:07] TLS: can't connect: (unknown error code). [16:53:33] Can you paste your commandline? [16:53:37] 3Wikimedia Labs / 3deployment-prep (beta): Template:Artwork does not contain templatedata - 10https://bugzilla.wikimedia.org/71340#c2 (10James Forrester) 5NEW>3RESO/WOR This sounds like it was an artefact of bug 50372. [16:54:01] ldapsearch -d 1 -H ldaps://ldap-eqiad.wikimedia.org -b ou=people,dc=wikimedia,dc=org '(cn=Coren)' [16:54:49] So it's definitely an untrusted cert issue. [16:55:09] that same command works from a labs host [16:55:33] andrewbogott: The root certs available on the labs host != those on ytterbium [16:55:37] yep! [16:55:56] So probably just need to copy some other things into /etc/ssl/certs [16:55:58] brb [16:56:12] andrewbogott: Lemme just check for updates to the Ubuntu package [16:57:35] 20130906ubuntu0.12.04.1 is what we got [16:58:27] Same as tools. [16:59:00] So if it works from labs, we installed locally. [16:59:25] And it does, so we did. [16:59:36] yep, puppet installs it. [16:59:49] Needs to be put in base then. [16:59:49] That's part of the patch I just merged. Probably it missed gerrit due to gerrit using a subset of the ldap classes [17:00:01] except… I copied it over by hand I think? [17:00:05] * andrewbogott looks again [17:00:30] I think there's some magic to do beyond just adding the file. [17:00:34] hm ok. [17:00:43] one moment... [17:01:12] You have to do update-ca-certificates too. [17:01:34] ok… done :) [17:01:47] hm, still doesn't work. [17:01:47] Still missing the cert [17:01:49] :-( [17:01:51] But, I will puppetize. [17:01:56] Any idea what is missing? [17:02:20] Not offhand, but clearly whatever you did in the puppet patch worked in labs. [17:02:33] So if you do the same on gerrit, it should work. :-) [17:08:52] Coren: https://gerrit.wikimedia.org/r/#/c/163194/ [17:09:17] That needs a close reading, and also keep in mind that I continue to largely not understand how certs work, so please doublecheck and make sure I'm not leaking something secure/imporant anyplace. [17:10:05] Coren: and also… hashar was able to merge a patch despite broken gerrit, I'm not sure what he did :/ [17:11:16] having an odd issue after updating mw-vagrant, it keeps failing provising when starting the job runner. I cant seem to find any logs about this, if i `tail -f $(find /var/log -type f) $(find /vagrant/logs -type f` in one shell and `start jobrunner` in another shell, no logs are generated [17:11:21] so not sure where to debug [17:11:46] the `start jobrunner` command just outputs: start: Job failed to strt [17:12:15] this is after a fresh destroy && up, although with a retained mediawiki directory [17:13:20] ebernhardson: It's an hhvm bug I found last night. [17:13:42] _joe_ is working on a fix; it has been fixed upstream already [17:14:00] bd808: awsome, i'll just adjust the script to use php5 for now and wait for your fix :) [17:14:08] or joe's, that is :) [17:14:17] If you want a local hack, symlink /usr/etc/hhvm to /etc/hhvm [17:14:27] lol [17:15:44] <_joe_> bd808: the fix is in the new package [17:15:56] <_joe_> so apt-get install hhvm solves the problem [17:16:05] _joe_: Awesome! ebernhardson ^ [17:16:24] hmm, i literally destroy/up during lila's breakfast, should have latest but i'll update and try agani [17:16:39] <_joe_> ebernhardson: apt-get update may help [17:17:20] ebernhardson: vagrant only does `apt-get update` once an hour-ish and not alwys on a new provision [17:17:29] I have a bug open about that somewhere [17:24:24] YuviPanda: now login works ! yay ! [17:24:30] yay [17:24:55] actually i think its a trailling comma problem? [17:25:07] updated and tried installing hhvm, but its already the latest version [17:25:19] swapped jobrunner to use php5, it now outputs: Exception: Could not parse JSON file '/etc/jobrunner.json'. in /srv/jobrunner/redisJobRunnerService on line 116 [17:25:37] the "redis": { } section has a comma after its last item, which is an error? [17:27:06] removed it from the template and re-provisioning, will see [17:29:55] us [17:30:30] 4 [17:31:17] _joe_: I've got hhvm 3.3.0-20140925+wmf2 installed and it is looking for /usr/etc/hhvm/php.ini. Is there a newer build that I'm missing? [17:38:18] YuviPanda: integration-dev-trusty, like intergration-dev, is a working copy to experiment with things [17:38:30] but integration-dev is quite dirty and runs precise, so its useless tom e [17:38:31] to me [17:38:43] !log mediawiki-core-team Updated self-hosted puppetmaster on ieg-dev to include ldap renaming patches [17:40:51] Krinkle: right, but -dev-trusty has puppet failures [17:41:00] I know, I didn't cause that [17:41:03] someone broke stuff [17:41:04] it would be nice for it to not have puppet failures :) [17:41:09] it's a prestine instance with just cislave applied [17:41:29] the failures will likely show up on the other instances as well if we set up a new plain integratin-slave100x [17:43:19] Krinkle: aren't the other instances precise/ [17:44:38] slave 1-3 is precise, slave 6-8 is trusty [17:44:41] YuviPanda: [17:44:44] hmm [17:44:50] https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup [17:55:27] andrewbogott: There are seven more certs in labs than on ytterbium. Trying to figure out what the diffs are. [17:55:51] Coren: note that things also work on virt1000. Might be a smaller diff. [17:57:48] What I don't get is that everything that looks like GlobalSign related is the same. [18:07:32] andrewbogott: ldap.conf. Labs refers to /etc/ssl/certs/GlobalSign_CA.pem explicitly and includes/etc/ssl/certs; prod only uses /etc/ssl/certs/ca-certificates.crt [18:08:37] ca-certificates appears to be constructed, but not out of what's in /etc/ssl/certs [18:10:55] andrewbogott: Ah-HA! [18:11:04] Coren: ? [18:11:22] I just tried copying ca-certificates.crt from virt1000, no change in behavior [18:11:25] andrewbogott: We've been doing it wrong. Certs we want to trust must go to /usr/local/share/ca-certificates [18:11:41] Coren: join us in #security and say that again? [18:26:30] !log mediawiki-core-team Updated puppetmaster on bd808.eqiad.wmflabs to include ldap patches [19:07:25] !log logstash Updated puppetmaster on logstash-deploy to include ldap patches [19:07:51] !log [19:10:20] jeremyb: morebots is AWOL [19:10:42] labs-morebots: hello [19:10:42] I am a logbot running on tools-exec-06. [19:10:42] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:10:42] To log a message, type !log . [19:10:53] Hm.. looks like wikitech lost my reference to openstack again? Any action I perform results in a very confusing error [19:10:55] !log foo [19:10:55] Message missing. Nothing logged. [19:11:00] like "host integratin-dev-trusty does not exist" [19:11:07] or "There are no instances in project integration" [19:11:17] !log mediawiki-core-team Updated self-hosted puppetmaster on ieg-dev to include ldap renaming patches [19:11:17] mediawiki-core-team is not a valid project. [19:11:23] !log mediawiki-core-team Updated puppetmaster on bd808.eqiad.wmflabs to include ldap patches [19:11:23] mediawiki-core-team is not a valid project. [19:11:36] !log logstash Updated puppetmaster on logstash-deploy to include ldap patches [19:11:37] logstash is not a valid project. [19:12:02] so now it's back but doesn't know about the projects? [19:12:55] !log deployment-prep Checking labs-morebots connectivity [19:13:12] Logging out and logging in again fixes it... [19:13:50] Krinkle: I had to do that today too. I think it happens when memcached is cleared on the server [19:14:03] Great story. [19:14:05] :) [19:14:26] Where is the other part of the session? [19:14:32] Because it doesn't log the wiki account out [19:15:22] That part has always been a mystery to me. But wikitech is a strange wiki [19:16:14] It probably has something to do with ldap voodoo but that's a random guess like the one before [19:17:55] !log deployment-prep Checking labs-morebots connectivity [19:17:55] deployment-prep is not a valid project. [19:18:37] labs-morebots: hello [19:18:38] I am a logbot running on tools-exec-06. [19:18:38] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:18:38] To log a message, type !log . [19:18:52] !log deployment-prep Checking labs-morebots connectivity [19:19:09] * bd808 gives up on the bot [19:34:24] Coren: andrewbogott: has the LDAP issue been solved? I would like to clean up the Gerrit credentials hack I made earlier today :] [19:34:38] hashar: yes, resolved. [19:37:51] andrewbogott: Coren I am leaving the Gerrit group ops-labs [19:37:58] let folks easily add you as reviewers :D [19:38:03] ok [19:39:02] andrewbogott: you should be able to add / remove members at https://gerrit.wikimedia.org/r/#/admin/groups/855,members [19:53:11] bd808, marxarelli : good news flow-tests alive and working, `sudo chmod -R o+rX /srv/vagrant` from https://bugzilla.wikimedia.org/show_bug.cgi?id=70959#c12 [19:53:19] did the trick [20:17:04] spagewmf: glad to hear it! [20:18:05] HAL, `sudo labs-provision virgin-state`. "I can't do that Dave" :) [20:58:39] <_joe_> bd808: I'm pretty sure it worked on the hhvm jobrunner in prod [20:58:58] * bd808 tests again [21:02:34] * bd808 waits for host he's on at the moment to update .... [21:05:39] <_joe_> bd808: I can confirm the jobrunner is using the correct php.ini and creating the sql3 repo in the correct place, searching the extensions in the right place, and so on [21:05:47] _joe_: Ah ha. I see the difference. If I invoke hhvm via the "php" symlink (as jobrunner does) it loads the right files [21:06:02] If I invoke it as hhvm it still looks a the old path [21:06:29] strace php hello.php 2>&1 | grep '/etc/hhvm' # /etc/hhvm/php.ini [21:06:51] strace hhvm hello.php 2>&1 | grep '/etc/hhvm' # /usr/etc/hhvm/php.ini [21:08:15] <_joe_> wtf? [21:08:36] <_joe_> oh zend_compat [21:08:41] <_joe_> maybe [21:08:50] must be. [21:08:52] <_joe_> ok, we need one moar patch I guess