[01:13:26] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:48:28] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:01:33] 06Labs: 'novaobserver' user missing from labtest - https://phabricator.wikimedia.org/T161071#3120598 (10Andrew) [02:53:48] 10Tool-Labs-tools-Pageviews: Add option to simpy download the data without "visual" output - https://phabricator.wikimedia.org/T154353#3120689 (10MusikAnimal) [02:53:50] 10Tool-Labs-tools-Pageviews: Implement a Node.js backend for each Pageviews app - https://phabricator.wikimedia.org/T157830#3120687 (10MusikAnimal) [04:32:25] 10Tool-Labs-tools-Xtools, 03Community-Tech-Sprint: Add a server-side caching service for the new XTools - https://phabricator.wikimedia.org/T161057#3120751 (10Samwilson) a:03Samwilson [04:44:11] 06Labs, 10Tool-Labs: Add interstitial to wikidata-externalid-url - https://phabricator.wikimedia.org/T160205#3092036 (10Esc3300) Eventually, this will be replaced by MediaWiki functionality. I suppose we could work without this tool in the meantime. [04:44:39] 06Labs, 10Tool-Labs, 10Wikidata: Add interstitial to wikidata-externalid-url - https://phabricator.wikimedia.org/T160205#3120773 (10Esc3300) [04:46:10] 06Labs, 10Tool-Labs, 10Wikidata: Add interstitial to wikidata-externalid-url - https://phabricator.wikimedia.org/T160205#3092036 (10Esc3300) [06:23:14] 10Tool-Labs-tools-Xtools, 03Community-Tech-Sprint: Add a server-side caching service for the new XTools - https://phabricator.wikimedia.org/T161057#3120796 (10Samwilson) Caching is set up, with Redis enabled on the labs install. I've added caching to a couple of places in helpers (namespaces and user IDs). The... [06:31:49] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [07:11:49] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [09:04:45] 06Labs, 10Beta-Cluster-Infrastructure, 10media-storage: Rebalance deployment-ms-be01 and deployment-ms-be02 so they run on different labvirt - https://phabricator.wikimedia.org/T161083#3120949 (10hashar) [09:47:54] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Rebalance tools exec nodes with an eye towards CPU usage - https://phabricator.wikimedia.org/T161006#3121037 (10hashar) [09:49:53] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3121038 (10Ebraminio) Thank you for the kind explanation. I was able to reduce the troublemaking query an user issued to the tool to this which still have database timeout issue: SELECT pl_title, COUNT(*) FRO... [09:55:43] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3121053 (10Ebraminio) (the weird point of the mentioned query on the previous comment is if you remove just one of the pages from the list or replace it with something else, it works just fine) [10:00:05] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10hashar) Might well be related to T161006 which suggest the Scheduler prioritize mostly based on RAM usage. So we end up with Nodepool instances spawning mostly... [10:01:30] 06Labs, 10Tool-Labs, 10Tools-Kubernetes: `jsub` is not available to Tool labs webservice running on Kubernete - https://phabricator.wikimedia.org/T161089#3121098 (10Tpt) [10:11:23] 06Labs, 10Tool-Labs, 10Tools-Kubernetes: `jsub` is not available to Tool labs webservice running on Kubernete - https://phabricator.wikimedia.org/T161089#3121098 (10zhuyifei1999) > Is it possible to make jsub available to Kubernate containers? That is very unlikely to happen. K8s don't know grid and grid do... [10:15:45] 06Labs, 10Tool-Labs, 10Tools-Kubernetes: `jsub` is not available to Tool labs webservice running on Kubernete - https://phabricator.wikimedia.org/T161089#3121209 (10Tpt) > That is very unlikely to happen. K8s don't know grid and grid doesn't know k8s. If you must submit jobs you can either make the webservi... [10:45:01] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3121274 (10jcrespo) What you comment is a known limitation of mariadb, it may send bad plans when there is more than 10 items per 'IN ()' expression, as it uses heuristics to calculate the best index. https://... [10:48:32] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Rebalance tools exec nodes with an eye towards CPU usage - https://phabricator.wikimedia.org/T161006#3121291 (10hashar) I have spawned at 10:41 UTC an instance `integration-c1.integration` (`24fe397e-7bd3-4c12-bde3-3e211c5f2671`) with **32GB of RAM**. It has... [12:08:22] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3121610 (10Ebraminio) Thanks for the great explanation and sorry for overloading the file. Just as a side note, unfortunately labs users don't have access to EXPLAIN (T152341) and I tried to do this locally y... [12:47:43] 06Labs, 10Labs-Infrastructure, 10DBA: Explore 'Analyze' statement as substitute for Explain - https://phabricator.wikimedia.org/T141095#3121693 (10jcrespo) https://tools.wmflabs.org/tools-info/optimizer.py no longer works, and that is a problem for users wanting to EXPLAIN their queries. [12:51:08] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3121695 (10jcrespo) > unfortunately labs users don't have access to EXPLAIN @Ebraminio actually, you do: See the workaround on: https://phabricator.wikimedia.org/T50875 There used to be a tool to do that easi... [13:39:26] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:39:30] PROBLEM - Puppet run on tools-static-11 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:39:38] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:39:52] PROBLEM - Puppet run on tools-worker-1013 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:40:08] PROBLEM - Puppet run on tools-worker-1023 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:40:32] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [13:40:38] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Deprecate precise instances in Labs by 2017-03-31 - https://phabricator.wikimedia.org/T143349#3121794 (10Andrew) [13:40:47] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161109#3121811 (10hashar) [13:43:06] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3121780 (10hashar) [13:43:16] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:43:21] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:43:55] PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:44:19] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:44:56] ^ possibly related to a wider DNS blip [13:48:00] !log tools migrating tools-bastion-02 in 15 minutes [13:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:59:15] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3121918 (10Reedy) [13:59:18] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161109#3121920 (10Reedy) [14:02:02] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Yeryry was created, changed by Yeryry link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Yeryry edit summary: Created page with "{{Tools Access Request |Justification=Beep boop. |Completed=false |User Name=Yeryry }}" [14:05:24] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3121940 (10Reedy) Yeah, it does look like additional ones are just spawned every couple of weeks The top output is surprisingly linear with time spent... ``` reedy@scanner00:~$... [14:06:32] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [14:15:08] RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:16] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:21] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:53] RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:17] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:25] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:31] RECOVERY - Puppet run on tools-static-11 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:39] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [14:19:49] RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [14:20:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [14:21:42] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3121998 (10Reedy) I've killed all but one of the scanner processes... Plenty from 2016, and the newest 3 are from January I suspect @dpatrick probably has the best idea of what t... [14:35:52] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3122016 (10hashar) Maybe it is meant to be a short time scan and it ends up taking too long / being blocked in a loop. [15:12:41] 06Labs, 10Labs-Infrastructure: Investigate instances with high "steal" CPU - https://phabricator.wikimedia.org/T161118#3122078 (10hashar) [15:14:11] 06Labs, 10Labs-Infrastructure: Investigate instances with high "steal" CPU - https://phabricator.wikimedia.org/T161118#3122097 (10hashar) [15:15:28] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Rebalance tools exec nodes with an eye towards CPU usage - https://phabricator.wikimedia.org/T161006#3122100 (10hashar) Seems Linux kernel in a guest is smart enough to find out instruction execution is being delayed by other instances on the same host (cpu... [15:16:20] 06Labs, 10Labs-Infrastructure: Investigate instances with high "steal" CPU - https://phabricator.wikimedia.org/T161118#3122114 (10hashar) a:05Andrew>03None [15:24:42] RECOVERY - Host tools-bastion-02 is UP: PING OK - Packet loss = 0%, RTA = 2.74 ms [15:26:03] 06Labs: 'novaobserver' user missing from labtest - https://phabricator.wikimedia.org/T161071#3122122 (10MoritzMuehlenhoff) 05Open>03Resolved Fixed, cn=ldap_ops was missing on the LDAP directory on labtestservices (and it defines the permissions for the uid=novaadmin user. I've added the group and now it work... [15:28:27] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [15:56:16] 06Labs, 10Tool-Labs, 10Prod-Kubernetes, 10Tools-Kubernetes, 07kubernetes: Fully document process for building a new version of Kubernetes debs - https://phabricator.wikimedia.org/T161031#3122186 (10akosiaris) I 've updated https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Building_debian_packages with... [16:14:43] RECOVERY - Host tools-bastion-02 is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [16:19:50] 06Labs: Providing index of backlinks table to labs replicas - https://phabricator.wikimedia.org/T159984#3122239 (10Ebraminio) Excellent! Thanks [16:43:24] 10Tool-Labs-tools-Pageviews: Auto-pagepile breaks encoding - https://phabricator.wikimedia.org/T161124#3122317 (10MusikAnimal) [17:52:09] chasemp: is there an etherpad for our meeting tomorrow already? [17:52:19] marxarelli: nope [17:52:26] dope [17:52:29] start one? :) [17:52:33] :) [17:58:22] chasemp: so just to get some clarity, we're sort of crashing your meeting, right? not leading it? [18:00:15] marxarelli: free form, I have some stuff ot talk about but if we spend an hour on what you guys are up to that's cool [18:01:42] chasemp: right on. i doubt we'll need that much of your time but good to know [18:02:04] all rivers lead to the ocean marxarelli :) [18:02:13] that's my philosophy here [18:02:44] is it a known issue when you get "Error: Could not request certificate: getaddrinfo: Name or service not known [18:02:59] mutante: new vm? [18:03:00] when running puppet. it's a fresh instance, just created [18:03:03] yes [18:03:07] can you ssh in? [18:03:10] yes, i can [18:03:11] heh. what's the ocean in this metaphor? grand theory of everything? [18:03:12] like, or looking at console? [18:03:22] i can ssh, but i see this when running puppet agent -tv [18:03:25] huh [18:03:35] that is extra weird as puppet had to be ok to get to that point [18:03:55] mutante: I don't have an answer, maybe andrewbogott or myself could look into it today [18:04:02] i created it, then went to horizon and picked a role, then SSHed to it, ran puppet [18:04:05] Error: Could not request certificate: getaddrinfo: Name or service not known [18:04:10] eh, sorry, this was the first run: [18:04:11] mutante: what project? [18:04:15] Info: Creating a new SSL key for dzahn-netmon.monitoring.eqiad.wmflabs [18:04:28] andrewbogott: project: monitoring instance: dzahn-netmon [18:05:02] also, it had to be working because [18:05:11] mutante: you're describing what happens when a project sets up its own puppetmaster on the first puppet run [18:05:16] when i SSHed in the first time after just selecting my role, i already saw the changed MOTD [18:05:20] so, I don't know anything about that project, but that sounds like what's happening [18:05:20] including my role [18:05:35] andrewbogott: ah yeah you're right [18:05:49] mutante: what's the puppetmaster in puppet.conf? [18:06:08] server = labs-puppetmaster-eqiad.wikimedia.org [18:06:24] hm [18:06:30] i wasn't aware or intended to have own master [18:06:30] ok, let me look... [18:06:49] i just used this because godog used it for similar things [18:06:52] thanks [18:07:34] mutante: have you tried running puppet as root? [18:08:01] omg, yes, i am now [18:08:21] andrewbogott: so sorry for bothering [18:08:24] I make that mistake semi-constantly so it's easy to recognize :) [18:08:29] hehe [18:08:31] thanks andrewbogott [18:09:01] * mutante headdesk.. sorry [18:11:10] you know what my excuse is: i have this in prod .bash_profile": 1 puppet() { sudo puppet "$@"; } [18:11:24] so it always adds the sudo [18:12:01] yeah I have sth similar in production, alias pat='sudo puppet agent --test' [18:12:30] I know it is going a disappointing experience running puppet so at least I'm pat-ting it [18:15:46] godog: I'm kidding I have that exact alias [18:16:46] chasemp: haha great minds think alike [18:19:11] chasemp: started one https://etherpad.wikimedia.org/p/labs-releng-k8s-confluence [18:19:20] thinking about it, my shortcut is RAGE, lol. "Ctrl+R age" autocompletes to "puppet agent -tv" from history [20:05:48] 06Labs, 10Labs-Infrastructure, 10DBA: LabsDB replica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3123104 (10jcrespo) [20:05:51] 06Labs, 10DBA: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3123103 (10jcrespo) 05Open>03Resolved [20:16:05] 06Labs, 10Tool-Labs, 10Wikidata: Add interstitial to wikidata-externalid-url - https://phabricator.wikimedia.org/T160205#3123114 (10ArthurPSmith) @Esc3300 well, I developed this tool because links for IMDB and a handful of other properties were broken when we made the change from string to "external identifi... [20:40:02] HELP [20:41:53] infobliss: ask your question [20:44:24] Hi, I am new to Tool labs. Was wondering how I can see the different tools and play with them. [20:46:34] the giant list of existing tools is at https://tools.wmflabs.org/?list -- There is a much prettier but shorter list at https://tools.wmflabs.org/hay/directory/ [20:47:01] infobliss: is there a particular type of tool you are interested in using? [20:47:42] May be the ones for uploading images into commons. [20:48:14] you could start looking at https://tools.wmflabs.org/hay/directory/?search=commons#/search/commons [20:48:49] https://tools.wmflabs.org/flickr2commons/ is a popular one I think [20:49:27] Oh great. Thanks bd808. [20:57:38] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Deprecate precise instances in Labs by 2017-03-31 - https://phabricator.wikimedia.org/T143349#3123245 (10hashar) @chasemp wrote: > thank you @hashar You are welcome. And thank you #labs team to have babysitted and pushed for that Precise phase out sprintĀ \o/ [21:01:19] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123251 (10Andrew) @hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem. [21:15:16] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123283 (10Andrew) The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99. The same run on labvirt1002 (which is actually doing useful t... [21:17:45] Hi infobliss, you are interested in https://phabricator.wikimedia.org/T138464? [21:20:03] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123314 (10Paladox) what about doing puppet agent -tv --debug --verbose to see what it is taking so long on? [21:24:20] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10yuvipanda) @Paladox what kind of things should I be looking for when running `puppet agent -tv --debug --verbose`? [21:25:08] marxarelli: hey! I've an hour-ish right now if you wanna just chat on IRC about the auth stuff you're running into :) Sorry about short notice, been a hectic week. [21:25:10] ok if not too :) [21:25:40] yuvipanda: sure! [21:26:12] marxarelli: where is the master running? [21:26:15] so i've sort of built a frankenstein's monster k8s master and node [21:26:21] ci-staging project [21:26:25] nice [21:26:44] marxarelli: what instance? [21:26:46] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123323 (10Andrew) And post-reboot it's fast again dammit [21:26:54] ci-staging-k8s-master and ci-staging-k8s-node [21:26:57] er -node01 [21:27:18] ok [21:27:34] and the local puppet hacking is on ci-staging-puppetmaster02 [21:27:36] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123325 (10Paladox) @yuvipanda hi, For example running it on gerrit-test3 returns P5113 So maybe it will tell us what bit it gets stuck on the longest. It will go through... [21:27:41] i should really at least get that into a patch [21:28:17] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123326 (10yuvipanda) @Paladox thank you. Do you know how to get timing information out of it? [21:28:50] marxarelli: :D yes :) also can you add me to the project? [21:28:56] I'm on a new laptop that doesn't have root keys, only normal keys [21:28:59] but long story short on the auth stuff: i created my own tokenauth file on the master for an admin user and client-infrastructure user, and temporarily disabled authz entirely in the /etc/default/kubernetes [21:29:07] oh sure thing [21:29:20] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123328 (10Paladox) @yuvipanda puppet agent -tv --debug --verbose --evaltrace https://ask.puppet.com/question/2755/howto-trace-execution-time-of-components-of-agent-run/ [21:30:04] marxarelli: ok! what problems are you running into? [21:30:27] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123329 (10Paladox) It will show it like Debug: Finishing transaction 20116140 [21:31:04] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123330 (10Paladox) or doing "If you have reports=true in your puppet.conf on the agent, you can see the time spent on each resource type. Reports are stored on the agent i... [21:31:23] yuvipanda: k. you're an admin in ci-staging now [21:31:51] mwahahah all the power [21:31:54] wait I already had all the powre :D [21:31:56] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3123343 (10Paladox) actually the command is puppet agent -tv --debug --verbose --evaltrace -td [21:32:00] i've been running into all kinds of problems, though they've mostly been on account of me learning k8s :) [21:32:17] but the most confusing thing so far has been trying to get the abac authz to work and failing miserably [21:32:33] marxarelli: ok! what are you trying to get it to do? [21:33:46] yuvipanda: we want to run a couple of experiments. 1) evaluate some crazy jenkins pipeline plugin by Pearson Technologies; 2) experiment with using k8s to provision Jenkins nodes using the k8s plugin for jenkins [21:34:24] * yuvipanda nods [21:34:34] so what kinda abac are you trying to do? :) [21:35:51] (I'll note generally that 1.6 should have complete RBAC, which is better in every way than ABAC) [21:36:04] nothing fancy, for now I just wanted to grant All The Things to an admin user and make sure the kubelet user had enough permissions [21:36:24] but i would settle for anything that works at this point, just so we can evaluate the jenkins plugin [21:36:47] and `--authorization-mode=AlwaysAllow` seems to work for now :) [21:37:06] but that seems like an awful workaround [21:37:11] marxarelli: ah, I see. so are you just trying to get puppet to do that for you? [21:37:12] obviously [21:37:28] marxarelli: that's actually a pretty decent workaround for now, I think. Especially if you don't have serviceaccounts... [21:37:40] i couldn't get abac to work even without puppet [21:37:51] i bumped up the loglevel to 2 and still didn't see anything useful in the logs about why the 403 was thrown [21:38:11] marxarelli: I usually set it to 10 when debugging [21:38:15] yuvipanda: yeah, maybe it's ok. i finally got the node added about 10 minutes ago :) [21:38:16] 2 doesn't do useful things I think [21:38:22] marxarelli: ah, nice. [21:38:26] had the wrong version of docker installed [21:38:30] marxarelli: prod does the same thing btw :) [21:38:44] (no authorization / authentication. waiting for RBAC) [21:38:52] ah! well, what's good for prod is good for me :) [21:38:55] so AlwaysAllow is totally ok for now. [21:39:00] nice. ok [21:39:18] marxarelli: yeah. there's a puppet param now for setting authz mode as well [21:39:43] yeah, saw that. seemed like it only supported abac or nothing though [21:40:41] yuvipanda: oh, another question! is cni needed on the nodes? i'm seeing "error updating cni config" in `journalctl -fu kubelet` because... well, i didn't include any of the cni modules [21:41:18] marxarelli: I think 'nothing' should give you what you want no? [21:41:32] marxarelli: depends. we don't use CNI in tools - we just use flannel integrating directly with docker. [21:41:52] marxarelli: while prod uses CNI with Calico - but you can't do that in labs because of some networking limitations of nova-network [21:42:02] marxarelli: so what you want is flannel and not CNI [21:42:09] oh, ok [21:42:44] so could you educate me a little on what they're used for. afaict, it's for vlans across container subnets. is that right? [21:43:16] like if you want processes in containers to be able to reach each other across pods/nodes? [21:46:41] yuvipanda: i see a couple of different flannel modules. is k8s::flannel right? [21:47:07] marxarelli: yeah, kindof. so kubernetes needs all containers to be able to talk to each other on a 'flat' network [21:47:19] so all containers can talk to all other containers [21:47:27] marxarelli: there are some foundational design docs from kubernetes that outline their assumptions for pod=>pod and pod=>service [21:47:27] https://github.com/kubernetes/community/blob/master/contributors/design-proposals/networking.md [21:47:46] ah, chasemp's link is probably an actual explanation of what I was gonna butcher :D [21:48:24] yuvipanda, chasemp: rad! thanks for the link [21:48:25] marxarelli: yeah, k8s::flannel is right, but it needs slightly intricate setup. look at toollabs/k8s/worker [21:48:48] marxarelli: I just redid the k8s master and worker role yesterday to reuse prod code too [21:53:48] yuvipanda: this is really great. i'll see if i can refactor my node manifest to use the same settings and be less kludgy [21:54:07] marxarelli: :D yeah, sorry i didn't do that earlier. This should make your life simpler [21:54:19] marxarelli: I'm planning on doing the same for the docker registry stuff this week / next [21:54:58] yuvipanda: awesome. the docker registry stuff seems a lot more straightforward to me, at least with the filesystem backend [21:55:10] marxarelli: :D nice! [21:55:12] cool! [21:57:40] yuvipanda: i think the hardest part about this has been that it's all new to me (etcd, k8s, flannel) and as great as puppet modules are for managing, they're not exactly great for making inferences about new systems :) [21:57:44] so i really appreciate the help! [21:58:57] marxarelli: I'm maybe slightly farther down the path for a bit of it and not for some so we can relish befuddlement together eh? [21:59:03] marxarelli: yeah totally [21:59:22] marxarelli: highly recommend https://github.com/kelseyhightower/kubernetes-the-hard-way btw [21:59:30] chasemp: for sure. i'm always up for mutual befuddlement! [21:59:57] marxarelli: super nice for building up a 'from the core' undersatnding of k8s [21:59:59] chasemp: madhuvishy ^ for y'all too, if / when you wanna read :) [22:00:18] yuvipanda: rad! [22:02:25] thcipriani, too ^ because he's only doing 1e6 - 1 things right now [22:03:20] marxarelli: now is the time to get his signature on some misleading document that signs his gold over to us [22:03:22] heh, I've got it starred on github already :) [22:03:36] marxarelli: :) madhuvishy informs me (since she's sitting next to me) that there's a Udacity course around it from Kelsey Hightower as well if y'all prefer that [22:05:06] yuvipanda, madhuvishy: "timeline: approx. 1 month" nice :) [22:05:08] so I did the udacity thing [22:05:15] for fun [22:05:21] https://www.udacity.com/course/scalable-microservices-with-kubernetes--ud615 [22:05:24] with google cloud free trial credits [22:06:02] chasemp: and now you're fitter, happier, and more productive? [22:06:18] I feel like 998 bucks [22:06:38] :D [22:06:43] it had some continuity issues but it's interesting, and kelsey is a badass [22:09:33] not much of a deep dive is my biggest complaint :) [22:15:46] right, it does seem introductory [22:25:37] !log wikilabels deploying 43d080c to staging [22:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [22:31:19] Looks good, deploying to prod [22:31:34] 06Labs, 10Tool-Labs, 06translatewiki.net: update node.js on tools.telegrambot - https://phabricator.wikimedia.org/T159368#3123542 (10yuvipanda) a:05yuvipanda>03None [22:31:36] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123543 (10yuvipanda) a:05yuvipanda>03None [22:31:40] 10PAWS, 06Research-and-Data-Backlog, 07Epic: Launch Open Notebooks Infrastructure - https://phabricator.wikimedia.org/T140430#3123546 (10yuvipanda) a:05yuvipanda>03None [22:31:42] 06Labs, 10Tool-Labs, 10Tools-Kubernetes: Setup Kubernetes Masters in a HA setup - https://phabricator.wikimedia.org/T142862#3123545 (10yuvipanda) a:05yuvipanda>03None [22:31:46] 06Labs, 10Labs-Sprint-100, 10Tool-Labs: Write documentation on new webservice code - https://phabricator.wikimedia.org/T132987#3123549 (10yuvipanda) a:05yuvipanda>03None [22:31:49] 10PAWS: Write 2-3 year vision for annual plan - https://phabricator.wikimedia.org/T131267#3123550 (10yuvipanda) a:05yuvipanda>03None [22:31:51] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123552 (10yuvipanda) This was done, and @madhuvishy just made it work... [22:32:04] 06Labs, 10Labs-Infrastructure, 10DBA, 13Patch-For-Review: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#3123554 (10yuvipanda) a:03madhuvishy [22:33:50] 06Labs, 10Tool-Labs, 13Patch-For-Review, 07Tracking: Make maintain-dbusers.py create replica.my.cnf files for user accounts as well - https://phabricator.wikimedia.org/T158420#3123556 (10madhuvishy) This is done now, all tools users should have replica.my.cnfs in their home dirs now and ongoing. [22:34:27] !log wikilabels deploying 43d080c to prod [22:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [22:58:21] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3123690 (10Reedy) 05Open>03Resolved a:03Reedy I've disabled the cronjob for now per @dpatrick's request on IRC till there is time to actually allocate to this project again [23:00:53] 06Labs, 07Security-Core: scanner00.security-tools.eqiad.wmflabs has 4 CPU at 100% usage - https://phabricator.wikimedia.org/T161107#3123712 (10hashar) Thank you both! [23:05:47] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:40:49] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0]