[07:06:08] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Kafka mirror maker failures when kafka brokers are restarted - https://phabricator.wikimedia.org/T157705#3532914 (10elukey) 05Open>03Resolved a:03elukey >>! In T157705#3352048, @Nuria wrote: > As part of kafka upgrade mirrormaker will get a revamp... [07:06:31] 10Analytics, 10Operations, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3532917 (10elukey) [07:06:40] 10Analytics, 10Operations, 10User-Elukey: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493#3532918 (10elukey) [11:25:11] * elukey lunch! [12:49:30] (03CR) 10Elukey: [C: 04-1] "I don't believe that this is a good approach, because it will force the same perms to all the users running report updater." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371955 (https://phabricator.wikimedia.org/T173333) (owner: 10Bearloga) [12:55:01] I am wondering where are the discovery reportupdater jobs in puppet [12:57:57] ah /srv/discovery/golden/main.sh [12:59:06] bearloga: o/ - If you guys start all the report updater stuff via the above script, you might just add an appropriate umask to the script and this should do the trick [12:59:10] discovery-stats@stat1005:/srv/published-datasets$ umask -S [12:59:12] u=rwx,g=rx,o=rx [12:59:30] so by default files are created with rx perms for the target group [12:59:43] I would use umask rather than hardcoding perms to report updater [13:01:16] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, 10Patch-For-Review: Reportupdater outputs files with restricted permissions - https://phabricator.wikimedia.org/T173333#3533380 (10elukey) p:05Unbreak!>03Normal [13:01:48] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, 10Patch-For-Review: Reportupdater outputs files with restricted permissions - https://phabricator.wikimedia.org/T173333#3524418 (10elukey) Added a couple of notes on IRC: ``` 14:55 I am wondering where are the discovery reportupdater jobs in... [13:49:22] 10Analytics-Kanban, 10User-Elukey: Archive PageContentSaveComplete in hdfs while we continue collecting data - https://phabricator.wikimedia.org/T170720#3533452 (10elukey) We noticed only a modest 100GB drop in disk usage on slave/store after the drop of the table. It seems that what reported by the following... [13:55:32] 10Analytics-Kanban, 10User-Elukey: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3533459 (10elukey) We dropped the `PageContentSaveComplete ` table and re-gained only ~100GB , that is not what we expected. I checked some numbers on the databases and reported... [14:33:10] elukey, bearloga, I think the problem is not only the published_datasets/discovery, but also the .rerun folders inside reportupdater, and also the log files outputed by reportupdater [14:34:57] also elukey, the discovery script (and by extension reportupdater) should support being executed by both discovery-stats user and any user in the wikidev group [14:35:29] so that they can mark reports for rerun [14:49:30] (03PS4) 10Mforns: Give group write permission to output files [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371955 (https://phabricator.wikimedia.org/T173333) (owner: 10Bearloga) [14:55:00] (03CR) 10Mforns: "I added a couple changes to also add permits for the .reruns folder and other files. After Guillaume's comments in the thread I would be t" (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371955 (https://phabricator.wikimedia.org/T173333) (owner: 10Bearloga) [15:08:39] gehel: o/ - do you have a minute for --^ [15:09:00] sure [15:09:11] I could be wrong but isn't umask enough for the report updater files created? [15:09:48] (added some comments in the task/codereview) [15:11:01] yeah, umask seems to make sense. [15:11:17] in main.sh, we could try that one first [15:11:24] I would argue that there is a deeper issue where that script creates files in the same directory as the sources, and that makes me slightly nervous [15:11:58] and as far as I can see, it does not address the issue of logs not beaing group readable [15:13:21] umask would probably solve the issue in a more generic way, but would still keep writing files in the same directory as the sources (which might or might not be a major issue) [15:13:28] I'll add that to the CR [15:15:19] (03CR) 10Gehel: "I quite like the suggestion of elukey to set umask instead of changing the script. This change actually does not address the fact that the" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371955 (https://phabricator.wikimedia.org/T173333) (owner: 10Bearloga) [15:30:26] Hi robh [15:30:35] heyas, im merging your access live now [15:30:55] well, manually running on bast1001.wikimedia.org and stat1003.eqiad.wmnet at least [15:31:04] and watched it put your ssh key back into place on both. [15:31:06] (just this second) [15:32:45] ok, before doing any other mistake, let me ask a silly question: the key that I sent through the phabricator, is for the stats machines, right? [15:49:00] addshore: o/ - do you have a minute? [15:49:08] dsaez: sorry, went afk to make food [15:49:17] so the key in phabricator is now for your production shell access [15:49:26] this means all bastions, plus stat and if you get rights to anything else, those [15:49:36] aka: everything not in cloud/tool/labs [15:49:44] its why it had to be a different key [15:50:08] ok [15:50:08] since the labs environment has potential compromise vectors of keys (since its shared systems with root on the instances being far far more easily obtained) [15:50:10] =] [15:50:52] not a silly question =] confirmed though the latest key in phabricator task comment is what i put into the production (stat) machines [15:52:55] elukey: ohia [15:53:24] ok, but it still asking for password :S [15:54:01] I get this error: debug1: key_load_public: No such file or directory but the file with the key is there ...any guess? [15:54:38] addshore: o/ - I created https://gerrit.wikimedia.org/r/#/c/372540/1/modules/statistics/manifests/rsync/mediawiki.pp to rsync the WMDE.logs to stat1005, just wanted to ask you if they contained super sensitive data but it seems ok [15:54:46] elukey: I'm fine with your patch if thats the way you want to go :) [15:54:59] yeh, not any crazy super sensitive stuff [15:55:07] super, merging :) [15:55:35] dsaez: ok, so some of this is going to repeat yesterday, but we change dthings so bear with me [15:55:45] ill just run down my mental checklist of things to look at [15:56:16] Can you go ahead and pastebin your current (from today's changes) ssh config for review on https://phabricator.wikimedia.org/paste/ ? [15:56:27] that way it'll not do funky format or anythign [15:56:31] just a cat of the file is good =] [15:57:16] the other option is just wipe out your entire config of ALL settings, and try to manually load your new key and ssh into bast1001.wikimedia.org (you wont be able to ssh to stat machines with an empty ssh config.) [15:57:41] basically we want to eliminate your config as a potential issue first, since its often the cause of grief for many. [15:58:46] addshore: done! Going to update the task [15:59:31] robh: so, I wipe the ~/.ssh/config file and ssh -vvv diego@bast1001.wikimedia.org , is that right? [15:59:58] wipe the file, ensure your new production ssh key is loaded (ssh-add -L will show if it is) and then ssh -vvv diego@bast1001.wikimedia.org yep! [16:00:13] that way we'll see if you can simply hit the bastion with no other config potential shennanigans [16:00:43] Also, just so I have some more familarity, are you using os x, windows, or linux? [16:00:44] thanks elukey [16:00:53] linux [16:01:03] aka: the most normal of them regarding ssh [16:01:03] hehe [16:01:11] part of me really worried you were going to answer windows =P [16:01:16] come on! [16:01:17] addshore: the logs will be kept for 90d, if you need more ask to otto :) [16:01:23] ack! [16:01:34] dsaez: not meant as insult, just i expect the worst ! [16:01:39] heh [16:02:01] first linux in my computer was Debian 1.0.0 in 1995 ;) [16:02:27] I was just recalling that I had to help a volunteer a large number of years ago with ssh access to the cluster, and after 2 days it turned out to be some kind of issue with the terminal program (putty) that he was using having a mismatch for the encryption of the session [16:02:30] due to it being ancient. [16:02:36] going off people, see you in a week! [16:02:39] so i just wanted to eliminate that ;] [16:03:09] ok, system ask for password [16:03:13] log here: https://phabricator.wikimedia.org/P5894 [16:04:26] reviewing [16:05:32] point it is, I don't have any key associatted to the bastion machines [16:05:33] im basically comparing your output to my own [16:05:45] so when you get cluster access, bastion is automatic [16:05:57] or else you cannot access things in the private vlan, so i can see on bast1001 you have an account and your user key is there. [16:06:07] let me toss my own output in a pastebin so you can see them both as well [16:06:12] ok [16:06:26] actually, wiping my own ssh config for identical stuff [16:06:56] ok, I might change the key on the bastion too, because I don't know where is that key coming from [16:07:26] https://phabricator.wikimedia.org/P5895 [16:07:30] ? [16:07:41] i dont get what you mean when you say you may change the key? [16:08:45] i can see on bast1001 it has your new ssh key you provided today =] [16:09:01] root@bast1001:/etc/ssh/userkeys# cat diego [16:09:01] ssh rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/oA9hk3snx7Y66ZR3sEMukK6tOk4esFT02irhqB0jr9osstyZN9KhPVeWMzhip+93ToDzX+aDHeXqsu5grXsqGQZBZSU850GSNt0pgi8+4E1yGQngLNhFv+z7EemWUQH0XT4atoDXbmfXdRv6NpVlcr1vAPBQpjBZuFe5vaxLKRrhZpm+YNvl4RHdGbZorz6WI0NvzuOTerKUOyUZS/KQpT4FxlvVRoTIO3H05EtJEv3745rUH5wHCcyr7m9Hdsjh8RMrWF3okKLk9WOxQTesfvvstmu8GpBbauzmQYMKwRcKQqoc0/qo3ZGvwXPwrYjv4wpPZhDexjOfUPLGvtGf dsaeztrumper@dsaeztrumperroot@bast1001:/et [16:09:02] c/ssh/userkeys# [16:09:10] meh, format is borked, but its the right key [16:09:15] ok [16:09:31] that was what I was asking before, if that was for bastion or prod [16:11:24] ahh, bastion = production host [16:11:56] so yeah, ic an see in my output and yours start to diverge on lines 109 and 93 [16:12:01] (your output and mine respectively) [16:12:12] i offer my pub key, and its accepted, for some reason your's is not. [16:12:50] your output shows debug3: receive packet: type 51 [16:14:06] im still trying to figure it out, but ive not seen this that i recall [16:14:18] ie: we may have to drag in some others with a bit more in depth experience. [16:15:08] google says this is typically caused by a bad character in the private key file [16:15:10] should i delete all my .ssh folder and start everything again? [16:15:14] which seems unlikely given you've made it twice [16:15:21] hmm [16:15:31] I did copy/paste [16:15:36] oh? [16:15:45] yeah, then you could have introduced [16:15:57] i assumed you used ssh-keygen to make the file in place, not a copy paste =] [16:16:01] damned assumptions! [16:16:07] no no [16:16:11] i did [16:16:13] what i mean [16:16:24] oh, no a bad character in the private key file, no the pub key string. [16:16:25] =] [16:16:25] to copy on phabricator I did copy/paste [16:16:39] the pub key seems legit to me when i compare to other pub keys in the files [16:16:44] its formatting looks the same. [16:16:48] yep [16:17:44] ohhh [16:17:50] does your user have permissions to access the private key? [16:18:02] (another google result for the type 51 output [16:18:03] ) [16:18:57] dsaez: im wathing the auth log [16:19:06] can you attempt a login to bast1001 as yourself now while im watcing? [16:19:17] ok [16:19:36] doing now [16:19:46] ok, got the output [16:19:54] Aug 18 16:19:31 bast1001 sshd[28175]: Failed publickey for diego from 93.176.156.65 port 8086 ssh2: RSA a1:f0:59:17:5d:81:61:38:12:2d:57:b9:ab:66:1d:85 [16:20:26] Aug 18 16:19:31 bast1001 sshd[28175]: Bad options in /etc/ssh/userkeys/diego file, line 1: [16:20:45] are we sure that my username is diego and not dsaez? [16:21:01] absolutely =] [16:21:06] i can see your user diego files [16:21:55] and i can see your entry in passwd [16:22:07] but it dislikes your key matching [16:22:21] im not 100% i should have pasted that line in here, im fairly certian it has nothing bad [16:23:31] ahhhh [16:23:40] dsaez: found it! [16:23:48] and i suspect it may be my fault... [16:23:56] ssh rsa versus ssh-rsa [16:23:59] =P [16:24:11] ! :D [16:24:26] im going to live hack it on bastion [16:24:30] and we try, if it works i fix right [16:25:38] dsaez: try to login now? [16:25:41] just to bast1001 [16:25:48] yes!! [16:26:01] ok, then it was totally that mistake [16:26:06] give me a few minutes to fix and roll it out [16:26:20] and im really sorry that my typo caused you grief ;_; [16:26:28] great, thx [16:26:50] I'm still not understanding the diference between the bastion / stats / production machines ... [16:26:57] and the respective keys... [16:27:54] so the basic is this, we have 3 realms in Wikimedia server in terms of security [16:28:29] cloud (labs/labs/labs), production (stat, bastion, mostly everything), and fundraising (its own little silo) [16:28:43] you need one key for production (all of bastions, analytics, stat machines) [16:28:54] and one key for cloud (all of labs, cloud, toollabs, etc) [16:29:10] then in terms of bastion / stats / production [16:29:21] some production systems have public ips that you can route to with nothign special [16:29:26] (like bast1001.wikimedia.org) [16:29:31] I see [16:29:34] and some have internal vlans, that you have to setup your ssh config to proxy though [16:29:41] like stat1003.eqiad.wmnet [16:29:49] thats not a FQDN in terms of the internet, heh [16:29:56] ok, so I was not using same keys for labs and prod, because I'm not using the labs (yet) [16:30:14] ahh, so there was some confusion there, oh well [16:30:16] I was confused between labs and bastion [16:31:54] at some point we'll have this on a nice graphic but ive been saying that for a long time now, heh [16:32:08] :) [16:33:14] ok, puppet has run on bast1001 and stat1003 [16:33:18] you should restore your ssh config [16:33:21] and try to login to both again =] [16:33:27] ok [16:33:52] I'm in [16:33:54] ! [16:34:05] ssh-rsa not ssh rsa =P i joked yesterday to another opsen that i felt badly for you since it seemed your access was cursed. it doesnt help when im doing the breaking ;D [16:34:46] i dont really want to check my patchset history, but i have a sneaking suspecion this is what caused the issue yesterday as well.... [16:34:53] but thats the past, moving along ;D [16:35:01] (srsly, sorry about that!) [16:35:02] hehe..cool [16:35:11] no worries! thank you very much [16:35:25] Let me know if you have any other issues =] [16:35:33] btw, should I also configure access to the labs machines? (do they have access to another information?) [16:37:52] labs is pretty much for testing things and experimentation [16:38:07] im not sure exactly what your job is, so im not sure if you need labs =] [16:38:17] but if its private data, and it seems to be, labs is not ok for that [16:38:31] rephrase: you couldnt put anything private in labs, since its not secure and folks without NDA have access [16:39:57] got it [16:40:14] labs is what used to be tools-lab servers? [16:40:28] so we're rebranding due to this confusion [16:40:32] aka: the labs labs labs problem [16:40:52] ok [16:40:52] https://wikitech.wikimedia.org/wiki/Labs_labs_labs [16:40:55] heh [16:41:02] it even has a goddamn wikitech page ;D [16:41:24] Wikimedia Cloud VPS (formerly Wikimedia Labs), a cloud computing infrastructure (maintained by Wikimedia Operations). [16:41:27] Toolforge (formerly Tool Labs), a platform for web services and bots run by volunteers (maintained by Wikimedia Operations). [16:41:29] Beta Cluster, a small wiki farm running the latest alpha version of MediaWiki. Sometimes improperly called "Beta labs" (maintained by Wikimedia Release Engineering) [16:41:49] hence the badly needed rebrand/rename that is currently ongoing. [17:32:19] 10Analytics: Troubleshoot Wikimetrics "magic button" - https://phabricator.wikimedia.org/T173585#3533876 (10mforns) [18:07:43] robh: just one more thing... I can access stats1003, but not 1005 [18:08:46] huh, should have access according to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups [18:08:50] lemme login and fire puppet and see whats up [18:11:44] yep, I suppose to be in the analytics-privatedata-users group that maps to stats1005 [18:12:20] sorry, got a phone call from a vendor [18:12:20] back now [18:12:44] hrmm, ok your userkey is in place on stat1005 [18:12:52] let me tail the auth log [18:12:55] and then you can try again [18:13:10] dsaez: ok, try logging into it again? [18:14:32] let me know when you do incase i dont see it [18:14:34] so far i dont [18:14:58] I'm in [18:15:02] great [18:15:04] thx [18:15:05] oh, well, yay [18:15:07] i did nothing [18:15:13] just looked at it sternly ;D [18:15:32] hahaha [18:15:32] it could have simply fired off an automated puppet run from when you tried and failed [18:15:36] to when i logged in and checked [18:15:43] it runs every 30m or so [18:15:58] who knows =P [18:16:29] hehe..mysteries of life ;) [18:16:57] thank you very much. [18:17:10] I think I have everything setup now [18:18:52] I just need to decide which irc client I'll be using. I'm collecting opinions. Which is your favorite? [18:33:12] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:10:57] dsaez: i use znc as a bouncer [19:11:08] and then connect with a desktop client cuz i dislike most irc in command line [19:11:32] irc clients and bouncers tends to have fairly strong viewpoints in wmf compared to elsewhere, hehe [19:11:49] i have been mocked by irrsi users saying that its irssi or death. [19:11:54] hehe [19:12:14] however, we dont actually run any kind of WMF aggregation or bouncing service, so its wholly individually run [19:12:22] though many wmf folks used some kind of irccloud thing [19:12:34] i am too paranoid for centralization of my chat client with everyone else in the org [19:13:10] im also lazy and using os x so my desktop client list is wildly different than yours (i use limechat though since its open source) [19:13:41] znc has just simply worked, and im happy to give you a copy of my scrubbed config for it [19:13:47] so you dont have to think about it much ;] [19:13:59] though you need a server somewhere to let it sit and run [19:14:53] (wikimeda cloud vps isnt ok to use for irc stuff last i checked, so cannot just run a bouncer there) [19:19:05] cool, thx, I'll keep on eye on this [19:19:12] analytics1055 alarms is an expired downtime, just fixed it :) [20:11:11] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Scap (Scap3-Adoption-Phase1): Use scap3 to deploy eventlogging/eventlogging - https://phabricator.wikimedia.org/T118772#3534263 (10Ottomata) Not really. You get a little bit of niceness in EventLogging because each event is an Event instance (a... [21:32:01] 10Analytics: Reportupdater: do not write execution control files in source directories - https://phabricator.wikimedia.org/T173604#3534404 (10mforns) [21:32:37] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis, 10Patch-For-Review: Reportupdater outputs files with restricted permissions - https://phabricator.wikimedia.org/T173333#3534417 (10mforns) Hi all! There is one gziped log that has +r for all users: `/srv/discovery/log/golden-daily.log-20170806.gz`. Lo... [21:38:13] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3534437 (10Halfak) [21:40:30] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3534441 (10Halfak)