[00:51:11] !log bots Re-enabled huggle-pg for wm-bot [00:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bots/SAL, Master [00:52:42] 06Labs, 13Patch-For-Review: Labs instances failing with "internal error: No PCI buses available" - https://phabricator.wikimedia.org/T137857#2432241 (10Matthewrbowker) [00:52:44] 06Labs, 10WM-Bot: wm-bot is not responding to any messages - https://phabricator.wikimedia.org/T139264#2432239 (10Matthewrbowker) 05Open>03Resolved I have re-enabled huggle-pg per the above. Bot is now operational. [00:52:57] wm-bot: yo [00:52:57] Hi mutante, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [00:53:27] well that is still good, its not "not responding" [00:53:40] Indeed. [00:53:45] That was a saga though xD [00:55:23] :) [00:55:30] thanks [00:55:42] You're welcome. [01:46:52] 06Labs: promethium.wikitextexp.eqiad.wmflabs (10.68.16.2, labs baremetal host) has strange DNS A record result, and missing PTR - https://phabricator.wikimedia.org/T139438#2432317 (10AlexMonk-WMF) [02:17:48] 06Labs, 10Tool-Labs, 13Patch-For-Review: Install inkscape - https://phabricator.wikimedia.org/T126933#2432342 (10Rschen7754) This would still be helpful so that other operators could run the bot. [07:33:55] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2432508 (10Pokefan95) [07:45:06] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2432527 (10Paladox) phab-04 and phab-03 have been deleted as a cleanup, we now have phab-02 (run on phab-05) and phab-01. [07:58:18] 06Labs, 10Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2432533 (10Paladox) [07:58:22] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2432531 (10Paladox) 05Open>03Resolved I fixed it. The problem was do to us adding role::phabricator::labs to phab-01 and didn't remove it causing it to add a file that was really meant for prod... [08:05:26] !log phabricator Im logging that i fix T139444 by removing role role::phabricator::labs that we doint need for phab-01 [08:05:27] T139444: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444 [08:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL, Master [09:00:37] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not delete/spawn instances anymore - https://phabricator.wikimedia.org/T139285#2432601 (10hashar) 05Open>03Resolved It is solved. What @paladox noticed yesterday was the pool of instances being exhausted a... [09:13:08] (03CR) 10Lokal Profil: "How easy would it be to use mustache or similar with Intuition?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297193 (owner: 10Lokal Profil) [10:08:52] (03CR) 10Lokal Profil: [C: 04-1] "Asking the question which shouldn't be asked... what does this do and does anyone use it?" (039 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [10:19:38] (03CR) 10Lokal Profil: [C: 032] "The url simply generates an empty page." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297116 (https://phabricator.wikimedia.org/T138513) (owner: 10Jean-Frédéric) [10:20:37] (03Merged) 10jenkins-bot: Use Wikidata item in API output formats [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297116 (https://phabricator.wikimedia.org/T138513) (owner: 10Jean-Frédéric) [11:14:26] (03CR) 10Lokal Profil: [C: 04-1] "Similar to the last one. What does the tool do and does anyone use it? That said it's broken now so if this fixes it then why not =)" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297499 (owner: 10Jean-Frédéric) [11:16:56] !log tools.heritage Deployed latest from Git: 3323de1 and 6d20267 (T138513) [11:16:57] T138513: Use wd_item in output formats - https://phabricator.wikimedia.org/T138513 [11:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [11:17:30] * yuvipanda pokes JeanFred [11:18:11] Hi yuvipanda [11:18:36] hi jeanfred [11:19:21] I wanna move heritage to the kubernetes backend. This will entail no behavioral differences for your workflow, except php5.6 (which doesn't have that many breaking changes from 5.5) and better resource isolation. I emailed to labs-l earlier... https://lists.wikimedia.org/pipermail/labs-l/2016-June/004534.html [11:20:07] do you have objections? Moving it now would provide a good time for us to test it all out before WLM rolls out, and at that time we can do better things to support it (like run multiple instances of it, have proper health checks, etc) [11:20:35] Sounds good to me :) [11:20:54] ok! [11:20:55] doing it now [11:20:59] Ok ! [11:21:14] yuvipanda: just wondering, when will k8s be able to run python-uwsgi stuffs? [11:22:05] zhuyifei1999_ probably in a week or so. it's a bit more involved because the virtualenv should be created in debian jessie and run in debian jessie [11:22:16] so I need to have a way to get people shell access on a debian jessie setup [11:22:28] this can also happen inside containers and I tried it out yesterday for moving lolrrit-wm [11:22:29] ok [11:22:32] and it works ok [11:22:58] just needs a bit more thought, mostly around how to expose it to people [11:23:00] yuvipanda: Can you log to tools.heritage when you go ahead? :) [11:23:04] yup! [11:23:20] !log tools.heritage restarting tool with 'webservice stop' 'webservice --backend=kubernetes start' [11:23:21] Looks like it went down and is up again \o/ [11:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [11:23:45] JeanFred yup! poke around and let me know if it works ok? [11:24:07] yuvipanda: for the Jessie venvs, a Jessie bastion is the best option ino [11:24:09] Imo [11:24:10] zhuyifei1999_ do you have a specific tool in mind you wanna move? [11:24:19] yuvipanda: :/ https://tools.wmflabs.org/heritage/toolbox/ [11:24:26] yuvipanda: video2commons-test [11:24:52] wait, let me reset to git master first [11:25:08] valhallasw: right, I don't like it because then I've to maintain list of packages in two places - puppet and containers, and then there's no isolation as well - things would work in your bastion because it has php and python and java all installed while your runtime environment only has php or python or java... [11:25:16] anyway, I'll play with it and write up a proposal. [11:25:18] Not sure where that PHP error comes from [11:25:25] JeanFred uh... [11:25:38] looking at code now [11:25:43] Might be unrelated − I had just deployed latest changes from Lokal Profil [11:25:50] yuvipanda: done [11:26:01] right [11:26:10] JeanFred looks like an intuition thing... [11:26:14] yeah [11:26:25] It probably i unrelated [11:26:30] ok! [11:27:23] yuvipanda: mmm. That's true. Let's try your way and see if it can handle the practical use cases [11:27:34] yuvipanda: feel free to break it as much as you want :P [11:27:59] valhallasw yeah. I tried it out for lolrrit-wm yesterday and it worked ok. need to figure out appropriate UX for it first [11:28:02] zhuyifei1999_ ok! :D [11:38:17] (03PS2) 10Jean-Frédéric: Attempt to fix glaring issues with layar server [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297499 [11:40:07] (03CR) 10Jean-Frédéric: "> Similar to the last one. What does the tool do and does anyone use it?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297499 (owner: 10Jean-Frédéric) [11:47:57] !log tools moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API [11:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [12:33:42] 06Labs, 07Puppet: Puppet failing on labtest* due to slice_network_constants() - https://phabricator.wikimedia.org/T139387#2432927 (10Andrew) 05Open>03Resolved Seems better with that patch. Thanks! [12:39:19] hey andrewbogott [12:39:20] around? [12:39:46] yuvipanda: not fully awake yet, but yes [12:40:59] andrewbogott I'll wait for you to fully wake then :) [12:41:35] it's ok — what's up? [12:45:32] andrewbogott I'm wondering if we want to keep the toolschecker check for labs puppetmaster [12:46:00] what does it check? [12:46:02] it's flaring now, since I moved tools-checker to the tools-puppetmaster-01, and this makes it hard for toolschecker to hit the labs puppetmaster... [12:46:08] it checks if you can get a catalog back [12:46:15] from labs puppetmaster [12:46:23] andrewbogott: I also need to reboot silver and californium for the trusty kernel security update, ok to do that now with some headsup in -operations? [12:47:44] moritzm: That seems like something that is best put on the deployment calendar so that no one is in a panicked documentation search during the outage [12:48:33] yuvipanda: I don't understand, why is the check on tools-puppetmaster-01? Isn't there a dedicated VM to host those tests? [12:48:44] no no [12:48:45] (In general I think that's a useful test) [12:48:58] oooooh [12:49:01] I see what you mean [12:49:13] so I started adding a check on tools-checker to chceck the k8s backend for webservice [12:49:28] but that requires that it use tools-puppetmaster-01 as puppetmaster [12:50:11] but there's also a existing check for labs puppetmaster that uses the puppet client certificate [12:50:39] and that's failing now since the puppet client certificate [12:50:47] is for tools-puppetmaster-01 and not for labs puppetmaster [12:50:54] hmm, ok. will do [12:50:54] well, plus, it's not that useful to have that check test the tools puppetmaster [12:50:59] moritzm: thanks [12:51:15] yeah [12:51:32] yuvipanda: So… I think it's an important test. Sounds like we just need a two different checker VMs [12:51:37] hmm [12:51:52] (which is probably a good idea anyway, I don't love having the tests run on something with a custom puppetmaster — lots of extra variables there.) [12:52:00] (03PS2) 10Jean-Frédéric: Fix ID dump process and tools [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 [12:52:15] andrewbogott hmm, I see. [12:52:48] (03CR) 10Jean-Frédéric: [C: 031] "Some adressed" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [12:52:53] yuvipanda: There's nowhere else that we check that puppetmaster, is there? [12:53:05] (03CR) 10Jean-Frédéric: "Comments" (035 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [12:53:07] andrewbogott we can check it on the puppetmaster itself... [12:53:12] since that's in prod [12:53:13] so you can nrpe [12:53:13] Since a puppetmaster failure could result in nag-emails being sent to a ton of people… it seems important that we hear it first. [12:53:16] and we already do have that [12:53:23] we don't have it paging though [12:54:35] I can probably workaround though - we have a *.tools.wmflabs.org certificate, and I can make the k8s master use that. [12:54:49] this also removes the need for things using the k8s master be on tools-puppetmaster, which is a good thing anyway [12:54:53] just gonna be a bit of work... [12:55:54] I don't totally follow why the k8s monitoring had to be on tools-puppetmaster… but yeah, if you can stand it I think that's better [12:56:11] andrewbogott because we have no other way for us to have paging checks other than tools-checker [12:56:37] but yeah, ok, I'll do the switchover instead to use *.tools.wmflabs.org certificate [12:56:50] thanks, sorry [12:57:02] np, I needed to do that at some point anyway [13:00:13] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Move k8s master to k8s-master.tools.wmflabs.org - https://phabricator.wikimedia.org/T139461#2433023 (10yuvipanda) [13:09:08] good morning [13:09:13] !log tools associated a floating IP with tools-k8s-master-01 for T139461 [13:09:14] T139461: Move k8s master to k8s-master.tools.wmflabs.org - https://phabricator.wikimedia.org/T139461 [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:10:02] andrewbogott: so yesterday was apparently all about libvirt1011 being the sole host in the scheduler pool and when it lost network that caused CI to no more spawn instance since no libvirt were left remaining :) [13:10:25] andrewbogott: you poked me at 21:00 etc yesterday stating other libvirt got added back and that definitely match when CI went back :} kudos on the fix! [13:14:39] andrewbogott: FYI, added to the Deployments calendar for 14 UTC today [13:14:51] doesn't conflict with any other deployments [13:14:57] moritzm: works for me! Thanks [14:06:11] yuvipanda: what would it take to have 2 physical servers in lab for a maps-beta cluster? akosiaris told me that we now have a proper way of doing that... [14:06:57] gehel I'm going to redirect you to chasemp :) there's technically a way but I don't know if us labs-ops have bandwidth to support / what support is needed / 'details'. I suggest filing a bug and then poking chasemp. [14:07:53] T138352 is filled... chasemp if you have a minute, let me know. I can probably do at least some of the work. But I am lost in what needs to be done [14:07:54] T138352: Enable maps beta service - https://phabricator.wikimedia.org/T138352 [14:08:09] hopefully not very much [14:08:34] although I do remember some discussions about how much time it takes [14:08:58] with gehel's help it might be possible to do it quickly. I 'll help as well as much as I can [14:10:07] I am tied up atm but will respond in a bit [14:10:17] chasemp: take your time, no rush [14:13:35] (03CR) 10Lokal Profil: [C: 032] Attempt to fix glaring issues with layar server [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297499 (owner: 10Jean-Frédéric) [14:14:39] (03Merged) 10jenkins-bot: Attempt to fix glaring issues with layar server [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297499 (owner: 10Jean-Frédéric) [14:28:18] !log tools.heritage Deployed latest from Git: 6e6cc59 [14:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [14:50:16] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433245 (10Dzahn) So which role does phab-01 use instead? And a labs role installed a file that was meant for production? [14:53:32] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433248 (10Paladox) @dzahn when I re set up phab-01 it seems I left role::phabricator::labs on it. So I unticked the box now and shoulden be applied. [15:04:53] 06Labs, 10Tool-Labs, 10puppet-compiler: toolsbeta: set up puppet-compiler / temporary-apply - https://phabricator.wikimedia.org/T97081#2433271 (10valhallasw) using whatever was in /var/lib/git/ops/puppet originally as 'old' and the current production branch as 'current': ``` valhallasw@toolsbeta-puppetmaste... [15:10:24] Yuvipanda I got puppet compiler to work! [15:10:38] Well, parts of it at least ;-) [15:13:40] valhallasw`cloud (IRC): \o/ nice! [15:13:56] In the middle of juggling T139461 I'll check back in a bit [15:13:56] T139461: Move k8s master to k8s-master.tools.wmflabs.org - https://phabricator.wikimedia.org/T139461 [15:24:09] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433344 (10Dzahn) @Paladox, ok, yea, but does that mean it does not use any role at all now? [15:25:34] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433365 (10Paladox) The role never worked due to dns not working in labs (carn't find dns phab-tin for scap) I installed everything manually. [15:28:26] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433370 (10Dzahn) Ok, so neither the "production" nor the "labs" role can be applied to any instance. I think the usefulness of these instances is limited if they are all manual and don't have much... [15:33:57] 06Labs, 10Phabricator: phab-01 and phab-03 to 04 returns a 502 error - https://phabricator.wikimedia.org/T139444#2433436 (10Dzahn) created T139475 [15:34:37] 06Labs, 10Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2048396 (10Dzahn) I think T139475 is important. Instances that are all setup manually have limited usefulness for testing any production change. [15:43:47] (03PS3) 10Lokal Profil: Fix ID dump process and tools [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [15:43:55] (03CR) 10Lokal Profil: Fix ID dump process and tools (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [15:44:08] (03CR) 10Lokal Profil: Fix ID dump process and tools (033 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [15:44:53] (03CR) 10Lokal Profil: "The last patch should have fixed my comments. Still can't check locally though due to sql dump." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297526 (owner: 10Jean-Frédéric) [15:46:51] (03PS1) 10Lokal Profil: Remove old sql dump before downloading new [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 [15:51:30] (03PS1) 10Lokal Profil: Run monument_tables.py from sub-directory [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 [16:05:50] I keep getting 'The connection to the server tools-k8s-master-01.tools.eqiad.wmflabs:6443 was refused - did you specify the right host or port?' every so often when I try to run any kubectl command, and then after a few minutes it works again. Is work being done on the k8s master or something? (pinging yuvipanda) [16:06:08] I'm actively futzing with it rigt now, tom29739 [16:06:22] this shouldn't affect running tools [16:06:47] but will affect kubectl operations [16:07:00] it's for T139461 [16:07:01] T139461: Move k8s master to k8s-master.tools.wmflabs.org - https://phabricator.wikimedia.org/T139461 [16:07:13] Ah, OK. [16:09:18] (03CR) 10Jean-Frédéric: "Good catch." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 (owner: 10Lokal Profil) [16:10:48] 06Labs, 10Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2433519 (10Luke081515) Actually that puppet role is not really working, so we need to change that first. [16:13:53] (03CR) 10Jean-Frédéric: "Hmmmm ; wouldn’t it be more robust to have the Python look for the sql files relatively to the file position ? Something like join(dirname" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [16:21:07] tom29739 it should be all good now. try? [16:21:36] Seems to work. [16:22:00] (03CR) 10Lokal Profil: "That could work. Not clear from the documentation though if it overwrites if a newer is found or simply downloads as [..].gz.1" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 (owner: 10Lokal Profil) [16:27:08] (03CR) 10Lokal Profil: "something like" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [16:27:53] (03CR) 10Lokal Profil: "well with the missing quotes of course" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [16:33:37] (03CR) 10Jean-Frédéric: "Exactly what I meant yes :)" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [16:57:06] 06Labs, 10Phabricator: Upgrade phab-01.wmflabs.org - https://phabricator.wikimedia.org/T127617#2433658 (10Dzahn) Exactly, that is the entire point of that ticket, have a working puppet role. [17:01:16] (03PS2) 10Lokal Profil: Remove old sql dump before downloading new [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 [17:10:59] (03PS2) 10Lokal Profil: Make monument_tables.py aware of file -directory [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 [17:22:38] andrewbogott: with baremetal-labs, the DHCP server is nova and not carbon. is that right? [17:23:14] then can i just delete this? https://gerrit.wikimedia.org/r/#/c/297534/5 [17:23:35] tom29739 ok, my maintenance is complete. you should be good to go [17:23:49] was wondering because there is no MAC address [17:23:50] can I ask what are you playing with kubectl for? [17:26:57] mutante: I'm not sure, I thought it used carbon for the initial OS install [17:27:11] Not sure how it worked with the typo though [17:29:45] mutante: if you 'git show 66505b3fc' you will see that that setting was reasonable, before that patch was applied. So I'd advise reverting to the state before that patch. [17:29:48] Unless I'm missing something [17:31:17] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Move k8s master to k8s-master.tools.wmflabs.org - https://phabricator.wikimedia.org/T139461#2433853 (10yuvipanda) 05Open>03Resolved Done! [17:31:33] andrewbogott: so.. there is the wiki page you wrote https://wikitech.wikimedia.org/wiki/Labs_Baremetal_Lifecycle [17:31:47] it talks about adding MAC addresses in hiera [17:32:00] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 07Tracking: Goal: Allow using k8s instead of GridEngine as a backend for webservices (tracking) - https://phabricator.wikimedia.org/T129309#2433869 (10yuvipanda) [17:32:02] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Provision a .kube/config file for all tools - https://phabricator.wikimedia.org/T133999#2433866 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done too! [17:32:17] alex says that ". The vlan assigned to the host (labs-instances1-b-eqiad) means DHCP will be handled by nova " [17:32:30] looking at that git show ... [17:32:56] oh [17:33:07] do we have a list of vlans somewhere? [17:33:38] mutante: ok, that sounds right — probably that entry in puppet is not needed then. It might be nice to have there just as documentation but it should be fine to remove. [17:33:44] last week I thought it was in a 'labs-hosts' vlan [17:35:22] Krenair: as far as I know the only complete docs are the dns repo [17:35:43] list of VLANs, somewhat.. clone DNS repo and [17:35:45] dns/templates$ grep "\; 10" 10.in-addr.arpa [17:35:46] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup monitoring for kubernetes core components. - https://phabricator.wikimedia.org/T131929#2433878 (10yuvipanda) I'm going to do this via toolschecker... [17:35:53] I replied yesterday actually: https://phabricator.wikimedia.org/T133300#2432288 [17:36:28] andrewbogott: thank you, and that is weird how i broke that myself, i have no idea [17:36:35] let's just remove it [17:36:51] but yea, in that change i removed the MAC by accident [17:37:55] Krenair: https://phabricator.wikimedia.org/P3347 [17:38:58] labs-instances1-b-eqiad is 10.68.16.0/21 which covers 10.68.16.0 to 10.68.23.255 [17:39:05] promethium is 10.68.16.2 [17:39:11] so promethium is in labs-instances1-b-eqiad? [17:39:44] yea, that matches what Alex said [17:39:51] and seems to make sense that it is an instance [17:39:54] not a controller [17:41:07] so the wiki page says " All servers will use a simple, default partitioning scheme" [17:41:16] that means it has nothing to do with partman, right [17:41:38] then we can also delete that entry in netboot.cfg [17:42:13] (03CR) 10Jean-Frédéric: [C: 032] "Tested locally, looks like it works as expected!" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 (owner: 10Lokal Profil) [17:42:14] yuvipanda, I'm trying to get my IRC bot to work on kubernetes. It seems to be going well, but I've been having trouble finding an image that has python on (and that works). My virtualenv doesn't seem to work in the container also. [17:43:07] (03Merged) 10jenkins-bot: Remove old sql dump before downloading new [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297606 (owner: 10Lokal Profil) [17:43:08] what container are you trying to use? [17:43:19] our k8s setup is restricted to only containers from docker-registry.tools.wmflabs.org and I don't have any python containers setup there yet [17:43:41] jessie-toollabs seems to work, and it's got python. [17:43:47] mutante, I also filed the other weird thing I found about that host: https://phabricator.wikimedia.org/T139438 [17:43:49] ah, right. [17:44:14] you need a virtualenv created in the same environment for it to work properly I think, so your trusty created virtualenv might not work [17:45:20] I have a jessie instance, so if I create it on there will it work if I rsync it over or something? [17:45:22] Krenair: yea..ugh, i saw. but that is back to "DNS problems in labs" ? [17:45:38] Krenair: i can only speak from prod DNS repo point of view.. and that does not have project names [17:45:41] mutante, this is a pretty unique DNS problem in labs [17:45:52] threre is promethium.eqiad.wmnet [17:45:53] we have messy PTR records elsewhere and a ticket or two about that [17:46:04] tom29739 you can use kubectl exec to get a shell inside a container [17:46:10] what. [17:46:14] Krenair: messy PTR records..but where [17:46:18] ugh [17:46:26] i am willing to fix anything in prod DNS [17:46:34] but dunno about labs dns [17:47:21] krenair@bastion-01:~$ host promethium [17:47:21] promethium.eqiad.wmflabs has address 10.68.16.2 [17:47:21] Host promethium.eqiad.wmflabs not found: 3(NXDOMAIN) [17:47:22] Host promethium.eqiad.wmflabs not found: 3(NXDOMAIN) [17:47:22] krenair@bastion-01:~$ host promethium.eqiad.wmnet [17:47:22] promethium.eqiad.wmnet has address 10.64.20.12 [17:47:24] I don't even [17:48:34] tom29739 I just added a python2 base container for you https://gerrit.wikimedia.org/r/#/c/297624/ [17:49:11] it's building now [17:50:15] I see only one promethium in rt... Maybe this is the same machine on two different IPs? [17:52:49] Krenair: maybe it has an out-of-date IP assigned in prod dns as well as the labs IP [17:52:50] ? [17:53:08] I don't know and I don't have the access to find out [17:53:18] this does not exist in prod DNS promethium.wikitextexp.eqiad.wmflabs [17:53:23] I do think there are things that need documenting and things that need fixing [17:53:32] this does promethium.eqiad.wmnet. [17:53:38] no wmflabs addresses are recognised by prod DNS, mutante [17:54:09] "wikitextexp" does not appear there in any way [17:54:15] Yeah, part of why I told the kiwix guy that I'd try to get him another server is that I want to go through the process again now that I've forgotten how I did it — 'documentation testing' [17:54:26] mutante, wikitextexp does appear in the proper version of the labs address [17:54:42] prod resolution of wmflabs addresses is being discussed in https://phabricator.wikimedia.org/T139011 [17:55:11] how is that prod resolution? [17:55:20] labs-puppetmaster [17:55:29] and labs-instance [17:55:42] labs-puppetmaster is an actual host, not an instance [17:55:54] pretty sure it connects to production puppet [17:56:20] labs-puppetmaster-eqiad is a CNAME for labcontrol1001 [17:56:25] i.e. labcontrol1001.eqiad.wmnet [17:57:06] ok, well [17:57:15] quote "reasonably sure this has never worked in this fashion" [17:57:16] same here [17:57:38] but maybe we can go back to promethium for a second [17:58:05] which DNS entry should it actually have [17:58:25] .eqiad.wmnet or eqiad.wmflabs [17:58:40] eqiad.wmnet or eqiad.wmflabs/wikitextexp.eqiad.wmflabs [17:58:52] do we know that this is one actual server? [17:59:30] we dont know much [18:00:09] (03CR) 10Jean-Frédéric: [C: 032] Make monument_tables.py aware of file -directory [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [18:00:18] I do see that I can't ping promethium.eqiad.wmnet (10.64.20.12) from bast1001.eqiad.wmnet [18:00:34] maybe you can try to get to it mutante [18:01:35] i did. cant ssh. trying mgmt now [18:02:18] (03Merged) 10jenkins-bot: Make monument_tables.py aware of file -directory [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/297607 (owner: 10Lokal Profil) [18:02:23] btw, that thing about "never worked in this fashion" also means we never had a phabricator instance installed by puppet ever [18:02:49] suprising how that did not even pop up as ticket [18:04:38] Krenair: it's a real server [18:04:44] yes [18:04:45] Debian GNU/Linux 8 promethium ttyS1 [18:04:45] promethium login: [18:04:49] via mgmt [18:04:55] can you successfully log in? [18:05:43] no, i can not [18:05:49] it's running though [18:06:00] I have to run in a moment, but… why the sudden interest in promethium? [18:06:06] As far as I know subbu is using it and all is well. [18:06:20] can we remove it from netboot.cfg ? [18:06:25] It's a labs instance, which means it has no console login, only ssh using labs/ldap keys [18:06:35] eh, that cant be true [18:06:38] i just connected to mgmt [18:06:40] i am using promethium, yes. [18:06:40] and logged in [18:06:47] Pretty sure it's not an instance andrewbogott [18:07:00] I mean, it's puppetized as though it's a labs instance [18:07:06] Debian GNU/Linux 8 promethium ttyS1 [18:07:07] yes but it's not actually one [18:07:11] right [18:07:15] I mean, it's not a VM [18:07:26] so, depends on what we mean by 'instance' :) [18:07:36] Anyway, gotta go. My original question 'why the sudden interest' remains :) [18:07:43] when it gets installed, where does the OS come from? [18:08:23] because there seem to be conflicting DNS entries [18:08:30] Don't remember what spawned interest in it but there are now a few open questions and missing/broken labs DNS records, strange prod DNS records and IPs [18:09:14] so if it's used [18:09:27] and not as "eqad" then we can safely remove that one [18:09:52] still not sure about the install/partitioning [18:10:27] oh right, I remember [18:11:56] it was from a discussion between mutante and hashar in -releng around 22:15 UTC yesterday [18:14:19] * subbu only cares that promethium continues to be operational [18:14:29] ok, you know [18:14:39] leave everything exactly as it is [18:15:18] Jul 05 23:12:45 mutante: the services on gallium are not heading to labs. There is too much to loose [18:15:18] Jul 05 23:13:15 mutante: we still need production services such as backup / l33t debugging skils / and ops being able to shutdown / look at log of CI as needed [18:15:18] Jul 05 23:13:39 mutante: on labs that is not really an option. The grant idea is to migrate the services to real hosts next to labs instances [18:16:34] that's how we got to the subject of baremetal hosts again andrewbogott [18:20:59] while looking around for more information about the setup I found issues [18:21:03] i'll stop spending time on this [18:21:13] besides reverting my own fault [18:31:48] reverted that DNS entry in prod to how it was before. whether it makes sense that this is on carbon i'll leave to the one who sets up the next of these machines [18:32:59] which won't be me then [18:42:43] mutante: the DNS happens via weird magic labs things, but the actual os install is a totally normal carbon install [18:42:53] so the partman/whatever info should stay [18:43:27] hey andrewbogott do you have a second to talk about this bare metal stuff? [19:01:00] 06Labs, 06Operations: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2434342 (10Dzahn) failed backups in Icinga again: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labstore1001&service=Last+backup+of+the+maps+filesystem https://icinga.wikimedia.org... [19:01:35] I'm having trouble logging into ores-web-03. [19:01:54] See a paste of -vv for failing to log into web-03 but succeeding at lb-02 here: http://pastebin.ca/3655798 [19:02:13] It seems that my private key is not good that particular machine. [19:03:09] Huh. works now. [19:03:10] I can try quick [19:03:12] Nevermind :/ [19:03:21] Must have been in the middle of a puppet run or something. [19:03:36] halfak: hm let me know if it prompts you for a new key or something randomly that could be a sign of a bigger issue [19:03:39] but otherwise seems ok yeah [19:03:50] It was unavailable for 22 minutes :( [19:04:01] Looks like it wasn't doing what it was supposed to for about that long. [19:04:16] Note the upper right graph here: https://grafana.wikimedia.org/dashboard/db/ores-labs [19:04:23] It's the node that sends "precaching requests" [19:04:42] I'll keep an eye on it. [19:05:54] halfak note that you can get your key added to the root keys list per project [19:06:08] look at 'passwords::root::extra_keys' in https://wikitech.wikimedia.org/wiki/Hiera:Tools [19:06:20] that allows you to log in as root even if LDAP is failing or NFS is failing [19:06:32] otherwise only labs opsen can login as root [19:06:33] Ohhh... Interesting. Thanks! [19:06:59] halfak there's some amount of docs on this at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Logging_in_as_root [19:08:44] https://wikitech.wikimedia.org/w/index.php?title=Hiera%3AOres&type=revision&diff=722612&oldid=637850 [19:08:48] +my key [19:14:01] halfak run puppet and then try logging in? [19:46:39] krinkle was looking for easy wins in migrating to k8s, and moved http://tools.wmflabs.org/krinkle-redirect/ [19:46:41] seems to work fine [19:47:08] that's a static page? ;-D [19:47:37] it's a php 404 handler that redirects [19:47:52] https://github.com/Krinkle/ts-krinkle-misc/tree/master/krinkle-redirect [19:48:34] actually, it's a combination of hardcoded 301 redirects [19:48:37] valhallasw`cloud 'easy win' :) [19:48:38] and the rest is 404.php [19:48:45] I moved http://tools.wmflabs.org/gmt/ [19:48:47] yuvipanda: lighttpd works? [19:48:50] krinkle yup [19:48:52] the config [19:48:54] php5.6 [19:48:58] yup [19:49:33] yuvipanda: IP in access log changed from 10.68.21.49 to 192.168.37.0 [19:49:48] right, because they're coming in from k8s [19:49:49] not htat I care [19:49:51] But it's interesting [19:49:55] yeah [19:50:04] if you do kubectl get svc [19:50:06] What's the scope of that? [19:50:13] you'll find the IP your tool has been allocated [19:50:23] and the source IP is just whatever was allocated for the tools-proxy instance I guess [19:50:27] Within what scope does this 192.* IP resolve? [19:50:39] you can hit it from inside containers, or in hosts where flannel and kube-proxy are installed [19:50:44] my tool? toollabs k8s overall, toollabs, labs? [19:51:35] tools k8s overall so far. each tool gets its own 'service' which gets its own IP from a /16 we put out [19:51:56] so the 192.168 addresses are not routable outside tools k8s? [19:52:03] and the access log registers 192.168.37.0, is that related to my tool, or is that the public proxy for all k8s tools? [19:52:18] krinkle-redirect 192.168.0.56 [19:52:24] I think that's just the IP of tools-proxy-01 [19:52:26] as seen by flannel [19:52:36] looking to verify [19:52:59] inet 192.168.37.0/16 scope global flannel.1 [19:53:03] yup [19:54:39] Hm.. can I query service groups on-wiki via SMW? [19:54:50] I can query projects I'm a member of, but haven't figured out how to do it for service groups [19:54:54]