[00:02:56] 10MediaWiki-extensions-OpenStackManager, 10Echo, 3Collaboration-Team-Current: Write presentation models for notifications in OpenStackManager - https://phabricator.wikimedia.org/T116853#1760264 (10Catrope) p:5Triage>3High [00:06:33] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760278 (10Dzahn) after this. on labvirt1001, the plugin got created: Notice: /Stage[main]/Openstack::Nova::Compute/File[/usr/local/lib/nagios/plugi... [00:37:01] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760353 (10Dzahn) works now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm+ssl [00:37:22] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760354 (10Dzahn) 5Open>3Resolved [02:24:15] #wikimedia-meeting [02:24:52] oops..sorry, just pecking in join commands and forgot the "/join" part [02:53:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [03:59:55] (03PS1) 10Dzahn: add fake new_install key pair for puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/249343 [04:01:01] (03CR) 10Dzahn: [C: 032] add fake new_install key pair for puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/249343 (owner: 10Dzahn) [04:01:11] (03CR) 10Dzahn: [V: 032] add fake new_install key pair for puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/249343 (owner: 10Dzahn) [05:10:37] 10MediaWiki-extensions-OpenStackManager: Description should be a required field on Special:NovaSecurityGroup?action=create in OpenStackManager - https://phabricator.wikimedia.org/T116885#1760898 (10Tgr) 3NEW [05:14:45] 10MediaWiki-extensions-OpenStackManager: Description should be a required field on Special:NovaSecurityGroup?action=create in OpenStackManager - https://phabricator.wikimedia.org/T116885#1760916 (10Tgr) Seems to be true for a lot of other forms/fields as well, e.g. the "add security group rule" form. [07:26:31] 6Labs, 10Tool-Labs: Make querycache, querycachetwo and querycache_info tables visible on labs dbs - https://phabricator.wikimedia.org/T65782#1761056 (10Krinkle) [09:33:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [09:46:39] PROBLEM - Puppet staleness on tools-k8s-bastion-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [13:07:03] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1761630 (10Andrew) thanks for fixing, sorry my patch was dumb :( [13:30:01] 6Labs, 10Tool-Labs: Install a PHP opcode cache on tool-labs - https://phabricator.wikimedia.org/T116780#1761682 (10daniel) [13:31:40] 6Labs, 10Tool-Labs: Install a PHP opcode cache on tool-labs - https://phabricator.wikimedia.org/T116780#1758200 (10daniel) @valhallasw: ok, thanks... that leaves me without an option for in-process caching for custom objects though. Is there a good way to do that on tool labs? Memcached would also be an option... [13:54:44] 6Labs, 10Labs-Team-Backlog, 3Labs-Sprint-107, 3Labs-Sprint-108, 3Labs-Sprint-109: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1761748 (10coren) This will start as soon as @cmjohnson reaches the datacenter. [14:20:59] Coren: 15h UTC is in 40 mins [14:21:24] * Coren attempts to calculate timezone offsets and fails. [14:24:20] NFS going down for the switch now. [15:38:41] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:42] ...thank you for scheduling downtime in icinga... [15:38:50] I have a problem with Labs and addshore said I should come here and ask you floks :) The configure link brings me to a "Manage instances--The specified resource does not exist" page as does reboot brings me to "projectadmin role required in project wikidata-dev--You must be a member of the projectadmin role in project wikidata-dev to perform this action". Additionally I can't connect to my instance over ssh anymore even though I was jus [15:38:50] t able to like 15 minutes ago. [15:38:52] uhm maintenance [15:38:53] frimelle: there is a maintenance ongoing that got announced on the labs-l mailing list [15:38:54] should take roughly one more hour for now [15:38:54] from now [15:38:55] Okay, thanks! :) [15:38:56] frimelle: lurk in this channel or on the list and whenever the maintenance is completed notifications will be send [15:38:56] Awesome! [15:39:10] afaict, NFS is returning. Instances depending on it should start recovering (but it may take several minutes) [15:39:11] I logged in to tools bastion; so that seems to be confirmed. [15:39:13] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 918661 bytes in 8.980 second response time [15:48:32] 6Labs, 3Labs-Sprint-107: Ensure that labstore machine is 'known good' hardware - https://phabricator.wikimedia.org/T106479#1762116 (10coren) [15:48:34] 6Labs, 10Labs-Team-Backlog, 3Labs-Sprint-107, 3Labs-Sprint-108, and 2 others: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1762114 (10coren) 5Open>3Resolved The switch is now complete, and labstore1001 is back to being the primary labs NFS server. Only a single is... [15:57:24] hi there! my cron jobs fail with ‘Out of memory’ and ‘Segmentation fault’, but I can manually execute the script without problems [15:57:30] is there a problem with jsub? [15:57:57] ireas: have you provided enough ram for your jobs? [15:58:35] gifti, as far as I remember, I did not configure a RAM requirement. but the jobs are running for months now, and there has never been a problem [15:59:17] Coren: so far, everything looks surprisingly fine :) [16:00:10] ireas: If they are data driven, it's possible that you're just occasionally pushing over the edge. [16:01:06] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1762160 (10coren) [16:01:08] 6Labs, 10Labs-Team-Backlog, 3Labs-Sprint-107, 3Labs-Sprint-108, and 2 others: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1762159 (10coren) [16:01:11] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1762157 (10coren) 5Open>3Resolved This was confirmed in trial by fire with the switch of labs NFS back to labstore1001. [16:02:42] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1762166 (10coren) [16:02:43] 6Labs, 3Labs-Sprint-107: Ensure that labstore machine is 'known good' hardware - https://phabricator.wikimedia.org/T106479#1762163 (10coren) 5Open>3Resolved a:3coren NFS service has been switched back to labstore1001; and labstore1002's controller is now being swapped out for a new instance. [16:03:47] 6Labs, 3Labs-Sprint-107: Ensure that labstore machine is 'known good' hardware - https://phabricator.wikimedia.org/T106479#1762171 (10coren) [16:04:57] okay, I’ll try to set -mem 500m. let’s see what happens [16:06:55] Interestingly enough, this was the first time when we got to actually *prove* that the NFS setup is able to switch masters as designed. It was always conceptually supposed to be the case, but to date everytime we tried we had stupid hardware issues. [16:09:33] Coren: I got a bunch of cron emails with random failures like "Out of memory!", segfaults, and bugs in jsub ~20 minutes ago [16:09:45] everything in the past 10 minutes appears to have submitted fine though [16:10:31] legoktm: That's not unexpected - the maintenance is likely to have affected things that tried to start during the window (though likely nothing that was already running or done) [16:10:37] ok [16:18:27] 6Labs, 10Tool-Labs: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1762218 (10Magnus) 3NEW [16:21:52] 6Labs, 10Tool-Labs: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1762242 (10coren) a:3coren Investigating this. [16:25:51] 6Labs, 10Wikimedia-Labs-General, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 5Patch-For-Review: Make pagecounts-all-sites available in Wikimedia Labs - https://phabricator.wikimedia.org/T93317#1762268 (10Hydriz) p:5Triage>3Low [16:30:59] 6Labs, 10Tool-Labs: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1762306 (10valhallasw) I can confirm this. It's not just that file, it seems to be the entire directory: ``` magnus@tools-bastion-01:/data/project/sourcemd$ touch test touch: cannot touch ‘test’: Permi... [16:52:06] 6Labs: Add a second pdns/mysql server - https://phabricator.wikimedia.org/T94865#1762397 (10Andrew) 5Open>3Resolved this is done -- it's called labservices1001. [16:52:57] 6Labs, 3Labs-Sprint-108, 3Labs-Sprint-109, 3Labs-Sprint-114, 3labs-sprint-113: Have catchpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1762412 (10Andrew) 5Open>3Resolved [16:52:58] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1762413 (10Andrew) [17:18:17] 6Labs, 10Labs-Infrastructure: live instance migration failing with 'Connection refused' - https://phabricator.wikimedia.org/T116935#1762518 (10Andrew) 3NEW a:3Andrew [17:19:11] Coren: any luck? [17:19:15] anything I can do to help? [17:20:36] YuviPanda: I figured out /what/, I'm trying to figure out the best /how/ atm. labstore1001 doesn't have any of the ldap crap - which was what we wanted - but we weren't quite ready for it for NFS. Faidon's wedge code is our best bet, I'm trying to read up at high speed on how to build a deb for a solib from C. :-) [17:21:07] Coren: wait what do you mean by 'labstore1001 doesn't have any of the ldap crap' - isn't it just the same puppet role? [17:21:55] did we test paravoid's code elsewhere before? [17:22:18] Yeah, it was tested on 1001 by hand. I just don't want to have this be a manual hack. [17:22:39] wasn't the ldap stuff puppetized as well? [17:23:13] Coren: alright. but if we're still without it in say another 30mins I vote we just setup LDAP, assuming that's just puppet tweaks [17:23:27] iirc it's a hiera tweak. [17:23:43] But yeah, I really want to /not/ pollute that box with the ldap crap if it can be avoided. [17:23:47] ok [17:23:51] Coren: let's timebox it to 30mins? [17:24:00] 6Labs, 10Tool-Labs: Static content should be always up, or there should be a place where I can put static content and that place should be always up. - https://phabricator.wikimedia.org/T116917#1762534 (10Rillke) So, in other words you say, I should add multiple mirrors to my scripts loading from my private se... [17:24:44] Hm-mh. [17:26:01] 6Labs, 10Tool-Labs: Static content should be always up, or there should be a place where I can put static content and that place should be always up. - https://phabricator.wikimedia.org/T116917#1762549 (10yuvipanda) No, that currently everything is dependent on NFS and we don't have a non-NFS based solution f... [17:31:37] 6Labs, 10Tool-Labs: Static content should be always up, or there should be a place where I can put static content and that place should be always up. - https://phabricator.wikimedia.org/T116917#1762582 (10valhallasw) If half an hour of downtime every now and then is problematic, I would indeed suggest other op... [17:32:45] wait, we're back to a non-puppetized labstore? [17:32:52] why did we migrate to something that wasn't puppetized? [17:32:53] come on [17:33:13] I'm not fully sure what happened. [17:33:35] I'm trying to piece this together too [17:35:44] 10Wikibugs: Allow configuring which type of events are sent to a channel - https://phabricator.wikimedia.org/T116939#1762584 (10yuvipanda) 3NEW [17:36:40] Coren: if the alternative setup needs building a deb, that's really not appropriate right now, just install ldap. [17:37:01] it's not that simple [17:37:02] mark: Yeah; that was my conclusion too. Pushing a patch in 2-3 mins. [17:37:24] you'll need to restart idmapd too right [17:37:30] and well, basically every daemon on the system [17:37:50] because nsswitch.conf gets read by glibc and initialized on process execution [17:37:51] paravoid: No, that's independent. The only thing that'll have to be restarted is nfs-kernel-server and that's pretty much transparent. [17:38:13] how is it independent? [17:38:21] aren't you planning on changing nsswitch.conf? [17:38:32] Yeah, but only nfsd needs to know about the groups. [17:38:45] (Something you established yourself) [17:39:01] what are you planning on doing? [17:39:06] re-adding ldap to nsswitch.conf? [17:39:52] paravoid: Reenabling the ldap classes to labstores (it was never moved to the labstore module since we wanted to get rid of it for got but then that got pushed back) [17:40:06] oh so it was just never puppetized to have the LDAP classes? [17:40:15] so, the answer is "yes, among others", right [17:40:17] YuviPanda: It was removed between 1002 and 1001. [17:40:26] paravoid: Yes, "yes among others" [17:40:34] this means that right now every process on the system will have a different view of the system accounts until it gets restarted [17:40:40] (or until the machine is rebooted) [17:40:50] including e.g. sshd [17:41:26] paravoid: I'm not sure I percieve the issue. None of the other processes care about the ldap accounts - they run no processes, have no creds. [17:41:49] paravoid: fwiw, we could even just use the LD_PRELOAD trick and it would just work. [17:42:03] Though I don't want to do that unpuppetized. [17:42:14] "...as we understand it" [17:42:31] paravoid: As tested previously. [17:42:35] the point is that you're ending up with a system that is in a very unique state [17:42:56] paravoid: YEs, which is why I was hoping to build the deb to fix it and deploy via puppet. [17:43:16] So that it wouldn't be. [17:43:17] processes running with an config and a new config on the system, in simple terms [17:43:32] *processes running with an old config and a new config on the filesystem, in simple terms [17:43:32] Well, not with the shared object thunk. [17:43:51] In which case only exactly nfsd has the different config by design. [17:44:04] I'm talking about 19:39 < Coren> paravoid: Reenabling the ldap classes to labstores (it was never moved to the labstore module since we wanted to get rid of it for got but then that got pushed back) [17:44:53] paravoid: Ah, yes, then we'd want a reboot if we wanted to avoid that situation (not the same config for all daemons). It's suboptimal, but it's not required. [17:45:08] so can you explain why did we just migrate production to a system that has manual hacks/tests on it? [17:45:54] paravoid: ... it *doesn't*. That's why it doesn't work. My error was not remembering that the old system was not in place but that the new one was only working during manual tests. [17:46:16] system = way for ldap groups to be known by nfsd [17:46:54] ok, why did we migrate production to a system that is unpuppetized/is not identical to the system we migrated from? [17:47:46] this should have been the case anyway in case for an unplanned failover, but even more so in a carefully planned migration [17:48:08] It is puppetized and running off the same manifest as the previous one. There was a change in the manifest since, however, which is the part which forgetting was the error. [17:48:25] which change? [17:48:55] root@labstore1001:~# grep ^group /etc/nsswitch.conf [17:48:55] group: compat [17:49:04] pretty sure this wasn't the case on labstore1002, right? [17:49:14] so the two systems are not identically configured, correct? [17:49:19] When we moved the labstore config over to a proper module, the ldap config wasn't moved alongside since it was on its way out. That had no effect on labstore1002 since there is nothing in the manifest to *remove* the ldap stuff. [17:49:52] 10Wikibugs: Allow configuring which type of events are sent to a channel - https://phabricator.wikimedia.org/T116939#1762636 (10Legoktm) Is there a channel/project that wants to use this? [17:50:19] 10Wikibugs: Allow configuring which type of events are sent to a channel - https://phabricator.wikimedia.org/T116939#1762643 (10yuvipanda) Yes, @halfak and #wikimedia-ai [17:51:12] And since labstore1001 was pristine from puppet, and could successfuly serve files, I considered it working. [17:51:44] The only effect of the missing ldap config - since we had removed all the other dependencies - is that supplemental groups aren't handled by nfsd [17:52:29] (More precisely, that supplemental groups past the - iirc - 8th one) [17:53:31] So right now there are two ways to get nfsd to do it again. Reconfigure ldap on the box as it was before the manifest was moved to the labstore module (but put it in the module), or make the change for the solib thunk and put it in the module. [17:54:04] hey, sorry [17:54:09] dropped I guess [17:54:11] 19:49 < paravoid> so the two systems are not identically configured, correct? [17:54:14] 19:53 -!- paravoid [~paravoid@scrooge.tty.gr] has joined #wikimedia-labs [17:54:19] that's the last I saw [17:54:22] well, 19:48 < Coren> It is puppetized and running off the same manifest as the previous one. There was a change in the manifest since, however, which is the part which forgetting was the error. [17:54:28] is the last I saw from others [17:54:33] When we moved the labstore config over to a proper module, the ldap config wasn't moved alongside since it was on its way out. That had no effect on labstore1002 since there is nothing in the manifest to *remove* the ldap stuff. [17:54:36] When we moved the labstore config over to a proper module, the ldap config wasn't moved alongside since it was on its way out. That had no effect on labstore1002 since there is nothing in the manifest to *remove* the ldap stuff. [17:54:43] And since labstore1001 was pristine from puppet, and could successfuly serve files, I considered it working. [17:54:45] The only effect of the missing ldap config - since we had removed all the other dependencies - is that supplemental groups aren't handled by nfsd [17:54:48] (More precisely, that supplemental groups past the - iirc - 8th one) [17:54:51] So right now there are two ways to get nfsd to do it again. Reconfigure ldap on the box as it was before the manifest was moved to the labstore module (but put it in the module), or make the change for the solib thunk and put it in the module. [17:55:04] Erm. Ignore the repeats. [17:55:20] in any case [17:55:36] the two systems were not identically configured for all this time as they should had been [17:55:42] that's error (1) [17:55:52] Yes. Nobody's arguing that. [17:56:16] and error (2) is that this was also uncaught during this supposedely carefully planned migration [17:57:41] correct? [17:58:22] (03PS1) 10John F. Lewis: send Mailing lists tickets to #wikimedia-mailman [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/249456 [17:58:48] Except for the snark around "supposedely", yes. While the server was tested to run all the daemons, and to have all the filesystems, and to properly serve files by NFS. There was no test for writing to a file via perms from a supplemental group; which could have caught this. [17:59:18] it's not a snark, if this had been done carefully this wouldn't have happened [17:59:59] let's focus on resolving this issue right now [18:00:15] both solutions require a reboot to return to a fully clean state? [18:00:25] no [18:00:53] there are: 1) the experimental new thing that hasn't been puppetized nor tested that won't require a reboot, just an idmapd restart [18:01:11] nfsd* [18:01:33] mountd, but anyway [18:01:59] And it's been tested to work fine; just never debianized. [18:02:20] 2) the fix-puppet-and-reboot which will result into the previously well-known state of labstore1002 [18:02:46] and the 3) fix-puppet-and-dont-reboot which will result into a mismatching runtime/config state that is imho a bad state to be in in the long run [18:03:19] Yep. [18:03:23] pick your poison :) [18:03:38] there's the separate issue of labstore1002 having a wonky raid controller [18:03:51] which we should actually test at some point [18:03:54] let's factor that in [18:03:56] mark: chris replaced it already [18:04:08] so if that's the case [18:04:11] should we switch back now? [18:04:20] by "previously well-known state of labstore1002" I meant into labstore1001 having an identical configuration as labstore1002 used to have [18:04:31] or has? dunno, the machine seems to be offline [18:04:41] because they're both connected [18:04:46] which is yet another issue [18:05:26] yes, let's not turn 1002 back on at all until we're very sure we have other safeguards in place to prevent simultaneous mounting/access [18:05:27] mark: no, 1002 has been disconnected. The H800 refuses scsi reservations and since your priority was to have zero chance of double connection. [18:05:42] that's not what we agreed on... [18:06:01] mark: Yes, you said to only keep it connected if I could make absolutely certain it could not be started [18:07:00] in private [18:08:03] so in that case [18:08:10] rebooting labstore1001 with ldap works just as well, right? [18:09:05] should be, yes [18:09:28] how long will that take approximately? [18:09:37] it's still very disappointing to me that we didn't fix T87870 before this migration [18:10:05] especially since it has been open for 9 months now, 4 of which are with a workable solution posted there, just never debianized/puppetized [18:10:19] and i wasn't aware it was related to this migration at all [18:11:47] mark: It shouldn't have been. That is my error in this - that this was partway done and that the indirect dependency wasn't picked up on. [18:12:14] so, how long will it take to reboot that box? [18:12:26] mark: POST is long-ish; 4 mins or so. [18:12:40] mark: once POST is past, starting NFS is 15s [18:12:42] i think we should do that, and have chris standby in case we need to move to 1002 [18:13:09] Allright, lemme move the old ldap config back into the labstore module. Patch coming in a few. [18:16:34] 10Wikibugs: Allow configuring which type of events are sent to a channel - https://phabricator.wikimedia.org/T116939#1762725 (10Halfak) Hi [18:16:38] 6Labs, 10Tool-Labs: Static content should be always up, or there should be a place where I can put static content and that place should be always up. - https://phabricator.wikimedia.org/T116917#1762726 (10Rillke) > The most obvious one is hosting the javascript on the wiki itself This might be a solution for... [18:18:03] halfak: hi [18:18:10] :D [18:18:12] 6Labs, 10Tool-Labs: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1762731 (10yuvipanda) To update: This is due to missing LDAP dependencies on the new labstore host. It's being investigated and we've potential solutions coming by. I'll keep this ticket updated. [18:18:23] :P [18:19:35] I'll draft an email to labs-announce ready to send if we need a restart [18:19:58] thanks yuvi [18:21:38] so, wait [18:21:46] this change isn't to labstore1001, it's to both [18:22:06] so I guess this means that labstore1002 wasn't fully puppetized either? [18:22:43] paravoid: Like I said, that bit of puppet manifest was removed in the anticipation of the solib thunk being deployed and *that* got delayed [18:23:00] I'm not sure I understand, sorry [18:23:01] ...why would you do that BEFORE doing the actual deb building and other work? [18:23:15] yeah that [18:23:21] also, which commit is that? [18:23:25] the one that removed that bit? [18:23:35] paravoid: Lemme check [18:24:28] so you basically removed a piece of puppet code in anticipation of another piece of puppet code that would do something equivalent in function [18:24:51] never cleaned up the local config on the machine that already had that applied, and never authored the new piece of puppet code [18:24:57] Well, as part of the "move from site.pp and role/labsnfs.pp to a proper module" [18:25:10] you're not making it better you know [18:25:25] you just said "as part of an unrelated code cleanup" [18:26:20] Iddb77e6d54ad3b2743e7cb35e864e1467f85baec [18:26:26] 6Labs, 10Tool-Labs: Static content should be always up, or there should be a place where I can put static content and that place should be always up. - https://phabricator.wikimedia.org/T116917#1762756 (10Rillke) >>! In T116917#1762582, @valhallasw wrote: > If half an hour of downtime every now and then is pro... [18:27:01] https://gerrit.wikimedia.org/r/#/c/220618/ [18:27:04] (03CR) 10Legoktm: [C: 032] send Mailing lists tickets to #wikimedia-mailman [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/249456 (owner: 10John F. Lewis) [18:27:25] (03Merged) 10jenkins-bot: send Mailing lists tickets to #wikimedia-mailman [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/249456 (owner: 10John F. Lewis) [18:27:52] legoktm: awesome :) [18:28:49] so is this reenabling of ldap gonna be a noop on 1002 now? [18:28:56] !log tools.wikibugs Updated channels.yaml to: e5e90fdb7faaa2b992321b1facd2799ae25d61e7 send Mailing lists tickets to #wikimedia-mailman [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [18:29:02] hopefully! [18:29:15] 1002 has been running with unmanaged config files on the system for a while [18:29:18] mark: It should be, since that's the very configuration that was not copied over in that patch above. [18:29:20] so what we should do I think is [18:29:25] disconnect and boot up 1002 [18:29:34] it's already disconnected apparently [18:29:35] so we can do that [18:29:39] It is. [18:29:42] config drift could've happened [18:29:44] plese do that [18:29:45] merge this puppet change, run puppet on 1002 and see if there are any diffs [18:29:48] yeah we need to check this [18:29:50] we even moved around LDAP servers in the meantime... [18:29:50] exactly what I'm thinking [18:29:54] yup [18:29:59] Okay, lemme disable puppet on 1001 in the meantime. [18:30:33] (oh, no, I think the LDAP server move was before the switch to 1002, so nvm) [18:30:53] but damn, a move or anything in the meantime could've totally messed up labstore. [18:31:16] 6Labs, 10Tool-Labs, 5Patch-For-Review: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1762765 (10yuvipanda) p:5High>3Unbreak! [18:35:00] Ok. 1002 is up [18:36:39] Merge? [18:37:01] if 1001 is puppet disabled, yes [18:37:06] It is. [18:37:21] I'll review/merge [18:40:12] paravoid: Did you start the puppet run or did cron? [18:40:22] I'm running a noop run [18:40:42] so the answer is yes, there was tiny config drift [18:40:47] it was alex's changes to ldap.conf etc. [18:40:53] but they were reverted afterwards [18:40:57] and a comment replaced them [18:41:01] so the changes are mostly whitespace [18:41:12] Apply on 1001 then? [18:41:18] plus some functional changes on "ldaplist", but I don't think this matters [18:41:39] oh also [18:41:39] Nope, only getent() matters [18:41:39] Notice: /Stage[main]/Ldap::Client::Openldap/Package[ldap-utils]/ensure: current_value 2.4.40+dfsg-1, should be 2.4.40+dfsg-1+deb8u1 (noop) [18:41:48] ensure => latest, I suppose [18:42:09] yeah ldaplist shouldn't matter [18:43:01] uhm [18:43:03] no, don't merge [18:43:17] Something else? [18:50:01] so yeah, let's merge on 1001 [18:50:10] as in run the puppet agent [18:50:11] * Coren watches. [18:51:12] Applied cleanly. getent reports the users as expected. [18:51:33] In theory, restarting the nfs server should fix the issue. Seems like a wise test before a reboot in case more is needed. [18:51:41] what was the output of puppet? [18:51:52] tell me before rebooting so I can email labs-announce? [18:52:01] and wait on ack before rebooting as well [18:52:43] paravoid: installing the libs, ldap.conf, nss.conf. Nothing surprising, I'm passing a fine tooth comb over it now. [18:52:54] paste the output somewhere? [18:54:28] eff. It got lost to my scrollback after a getent passwd I did to test the change filled it. [18:54:47] You'd *think* 1000 lines was enough. Hang on, it should be on disk. [18:57:19] Gah, no, not retroactively. Packages installed: nscd ldapvi libnss-ldapd ldap-utils nss-updatedb python-pycurl python-ldap libnss-db [18:58:06] ok.. [18:59:57] paravoid: do you want to do more checking or are we ready to move forward? [19:00:02] Also /etc/ldap.conf /etc/ldap.yaml /etc/nsswitch.conf /etc/nscd.conf changed [19:00:46] mark: I'd restart nfs, confirm that fixes the issue. [19:01:05] no, things look good [19:01:31] s/good/like they were on labstore1002/ :) [19:01:44] and yes, what Coren says makes sense [19:01:53] let's find a test case first [19:01:58] something that's broken now [19:01:59] (and codfw now has both admin and ldap users) [19:02:26] YuviPanda: It'll get cleaned up at the same time as the other two [19:02:30] Coren: paravoid mark shall I email labs-announce? [19:02:59] paravoid: The one in the bug makes sense https://phabricator.wikimedia.org/T116927 [19:03:36] paravoid: Although, any service group would do - nobody has them in primary [19:03:55] as you said earlier, it's not about primary or not [19:04:08] it's whether you hit the supplementary gid limit [19:04:28] NFS AUTH_SYS' [19:04:31] Ah, that too. (Though almost every labs user hits it by a large margin) [19:04:54] Same test case is easiest. Should I restart nfs? [19:04:59] reach the limit in the specific use case [19:05:50] can you reproduce magnus' problem? [19:06:12] ... no [19:06:22] (That's what I just tested) [19:07:22] 10Tool-Labs-tools-Other, 6Commons, 6Community-Tech, 10Internet-Archive: Create a new DerivativeFX after the Toolserver shutdown [AOI] - https://phabricator.wikimedia.org/T110409#1762906 (10DannyH) [19:07:43] so find a test case? [19:07:51] I'm looking for one as we speak. [19:07:52] before restarting anything [19:11:51] I'm not finding one - creating one. [19:12:32] * Coren is confused. [19:12:44] I can write to files owned by any of my 50+ supplementary groups. [19:12:58] Yet nfsd was not restarted, by puppet or otherwise. [19:13:17] Last start in dmesg: [Wed Oct 28 15:35:19 2015] NFSD: starting 90-second grace period (net ffffffff818ca000) [19:14:25] perhaps it does pick up those changes? [19:15:08] mark: Well, it looks like it did, but I'm wondering how. [19:15:36] it's in-kernel isn't it [19:16:22] mark: the kernel upcalls to the userland daemon; that's how the solib thunk worked. [19:16:55] But yeah, as far as I can tell the actual issue is no longer there. [19:17:07] nscd! [19:17:30] libnss would talk to nscd, and _it_ was restarted by the change. [19:18:00] So anything that uses getent() would work. [19:19:32] Yep. nsdc statistics show cache hits. [19:19:50] 28m 22s server runtime [19:20:22] Everything picked the change up because nscd picked it up. [19:22:27] paravoid: Turns out restarting NFS is unneeded. [19:23:12] i think paravoid's point was that this doesn't guarantee consistency for the entire box [19:24:01] hrm, this is confusing [19:24:06] Well, it does for anything that uses libc to resolve; let's wait to see what he thinks? [19:24:19] Within each process that uses nsswitch.conf, the entire file is read [19:24:23] only once. If the file is later changed, the process will continue [19:24:26] using the old configuration. [19:24:26] that's from nsswitch.conf(5) [19:24:54] * YuviPanda holds off on sending any emails [19:24:59] Sure, but that doesn't take nscd into account. glibc() calls /that/ if it's there, and it was restarted at the puppet run [19:25:17] glibc getent*() [19:28:09] (sec) [19:28:28] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:30:22] related? [19:30:42] unlikely. Checking. [19:32:27] yeah, confirmed [19:32:32] nscd saved us in this case [19:32:44] it's still ugly in the sense that if nscd dies the behavior will be different [19:33:10] re puppet error: not related. Timed out. [19:33:18] (Another run fixed it) [19:33:54] paravoid: I'm thinking: restart nfs as a safety net, but no need to reboot and cause an outage? [19:34:38] I guess, yeah. [19:35:25] doing a quick restart now. [19:35:49] is there a labstore migration planned anytime soon for other reasons? [19:36:20] We may want to switch again to (a) double check puppet compliance and (b) test the new H800 in labstore1002. [19:36:32] paravoid: no [19:36:34] But neither is planned in the shortime. [19:37:01] or, you know, to fix that idmapd mess that has been there for almost a year now [19:37:46] Yeah, that too. I don't think mark would begrudge my putting this at the top of my stack. [19:38:18] Well, not idmapd - that hasn't been around in ages - but I know what you mean. [19:38:25] yeah [19:38:46] yep, let's do that as high prio and then plan another switchover to 1002 [19:39:21] Now I need my lunch; it's almost 4 and I haven't eaten since breakfast. [19:39:28] be back in a few minutes [19:39:47] (paravoid: btw, nfs restarted and back to full health) [19:40:01] 6Labs, 10Labs-Infrastructure: live instance migration (block migration) failing with 'Connection refused' - https://phabricator.wikimedia.org/T116935#1763129 (10Andrew) [19:40:05] 6Labs, 10Labs-Infrastructure: live instance migration (block migration) failing with 'Connection refused' - https://phabricator.wikimedia.org/T116935#1762518 (10Andrew) 5Open>3Resolved [19:40:21] Coren: do !log in -operations next time :) [19:43:29] RECOVERY - Puppet failure on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [19:45:23] 10Tool-Labs-tools-Other, 6Community-Tech: Improving Magnus' tools (tracking) - https://phabricator.wikimedia.org/T115537#1763158 (10DannyH) [19:52:38] YuviPanda: Notice: Finished catalog run in 21.18 seconds [19:52:42] *laughs maniacally* [19:53:47] (on toolsbeta-bastion) [19:54:44] YuviPanda: where should it be under ops-puppet? [19:54:57] ('/etc/puppet/modules/valhallaw/lib/puppet/provider/package') [19:55:00] valhallasw`cloud: where should what be under ops-puppet? [19:55:02] ah [19:55:11] which module you mean [19:55:15] ya [19:55:17] I don't know. 'apt' maybe? [19:55:21] valhallasw`cloud: or wmflib [19:55:27] or maybe not in a module at all? not sure [19:55:32] depends on how specific it is [19:55:35] it's currently constrained to just toollabs [19:55:40] but that's mostly to not kill prod :-p [19:55:55] valhallasw`cloud: ah heh. put up the patch and we'll see what happens? [19:56:01] sure [19:56:01] just in toollabs module then [19:57:34] ok [20:03:43] 6Labs, 10Tool-Labs, 5Patch-For-Review: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T116927#1763205 (10coren) 5Open>3Resolved This should be fixed as supplemental are now fully recognized again. [20:09:45] Coren: paravoid mark so no restarts? [20:09:51] not for now, no [20:09:53] ok [20:09:55] YuviPanda: Nope. [20:10:04] ok [20:10:12] so how do we manually get rid of the LDAP stuff on 2001? [20:10:17] https://gerrit.wikimedia.org/r/#/c/249484/ does the puppet bits [20:10:20] but still needs manual cleanup [20:10:51] YuviPanda: There's no point - the only reason it was not there was testing. When we do the proper fix with the solib, I'll add the required ensure => absent [20:11:02] Coren: it has admin and ldap [20:11:06] so one of those need clenaing up [20:11:09] Oh, right. [20:11:10] it's puppet failing right now [20:11:17] * Coren ponders. [20:11:32] Better remove admin for now like the others. It's easier to manage if all the labstores are the same. [20:11:36] Lemme do a patch [20:13:15] ok [20:26:28] Coren: ok, so I'm gonna merge that and then undo nslcd.conf, nsswitch.conf and ldap.conf [20:27:00] nscd.conf should also have been changed, but that's not an issue in practice. [20:28:59] what does that mean? [20:40:40] Coren: ^ [20:43:23] ehm, yea, i changed a kind of big looking change to openstack module [20:43:34] but i swear they are almost all just comments and i had a +1 [20:43:46] double-checked [20:44:06] just so you are not wondering when you see a diff [20:44:48] yeah shouldn't affect anything I'm doing but thanks for the headsup mutante [20:45:33] Coren: what did "nscd.conf should also have been changed, but that's not an issue in practice." mean? [20:50:47] hmm trying to figure out how to actually get 'patch' to accept the diffs puppet produced [20:53:22] chasemp: that installing ldap, iirc, comes along a nscd.conf with slightly different values but that those changes are entirely immaterial. [20:53:55] (IIRC, a reduced negative cache timeout for passwd) [20:54:21] I'm just going to apply that diff by hand [21:01:30] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111, 5Patch-For-Review: Support cold-migration or suspended migration, or something, between labvirt hosts - https://phabricator.wikimedia.org/T109902#1763444 (10Andrew) 5Open>3Resolved Since live migration sort of works now, this can be closed. [21:03:11] paravoid: Coren chasemp ok, clean puppet run on 2001 and it's gotten rid of the ldap stuff now [21:03:22] YuviPanda: Thankeee. [21:03:49] In other news, I think I finally got the hang of diversions in a deb. I should have a tentative package to review tomorrow. [21:04:00] Coren: however, I think we should re-image and even rename labstore2001. even if in some future we have a labs in codfw, labstore2001 is just an backup target and probably shouldn't be an active nFS server [21:04:06] let me file a task [21:05:16] !log ores Deployed ores-wikimedia-config:c27c8f7, ores==0.5.4, wb-vandalism==0.1.5, revscoring==0.6.7 [21:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [21:10:11] 6Labs, 6operations: Cleanup / clarify labstore2001 - https://phabricator.wikimedia.org/T116972#1763469 (10yuvipanda) 3NEW [21:10:17] Coren: chasemp ^ [21:12:32] Coren: can you also file bugs to test the new controller? and I don't know if there's a bug to track having both hosts be connected to the shelf? [21:21:28] There isn't, but they cannot be both running and connected. [21:21:35] At least, not safely. [21:22:13] Coren: right. but let's open a bug to keep track of that? [21:54:59] Hi YuviPanda. Are you around today for a chat with Ellery and I about article rec? [23:29:07] leila: was away, just saw this. too late I guess? [23:29:45] adding you to the call which will start in 1 min, YuviPanda. ;-) [23:29:55] leila: nooo I'm already in a call starting in 1min [23:30:03] ah! [23:30:05] (different one) [23:30:07] it's for 30min [23:30:21] np. Dario, Ellery, and I will talk and I'll touch-base with you after that. [23:43:52] leila: my previous meeting finished early so I'm here