[00:24:37] 3Wikimedia-Labs-wikistats: [Errors 992] Failed data retrieval shouldn't update "Last update" column - https://phabricator.wikimedia.org/T46145#1002867 (10Dzahn) p:5Normal>3Low [01:02:03] 3Tool-Labs: -once is not checked correctly by jstart randomly - https://phabricator.wikimedia.org/T60145#1002983 (10Krinkle) May be related to {T62862}, which is still happening frequently. [01:11:54] I heard labs blew up again today. [01:12:13] Do I need to restart the tools I maintain? [01:13:15] Channel topic isn't very helpful since it hasn't been updated since yesterday's restarts. [01:35:53] T13|detached: I updated the topic. [01:36:36] Using the API (preferrably) or the database, how do I get new files created [01:36:37] ? [01:36:55] T13|detached, labs is dying is more like it. Until Coren gets back the NFS is deteriorating and the web grid is overloading. [01:36:56] Magog_the_Ogre: you mean uploading files to a wiki? action=upload [01:37:15] er, no legoktm [01:37:43] getting the files uploaded by date, as in a list of them [01:37:52] ah [01:37:53] one sec, [01:38:16] currently my bot is looking at the log and then following the redirects, but that's imprecise because the redirects can be vandalized, deleted, etc. [01:39:37] https://commons.wikimedia.org/w/api.php?action=query&list=recentchanges&rcnamespace=6&rctype=new ? [01:40:27] oh I found it already [01:40:27] https://commons.wikimedia.org/w/api.php?action=query&list=allimages&aiprop=user|timestamp|url&aisort=timestamp&aidir=newer [01:40:43] google is an amazing tool that works at least 40% of the time [01:41:11] legoktm, recentchanges isn't it because I don't think it captures uploads, rather just new pages created apart from uploads [01:42:39] right [01:43:54] thanks though [02:54:33] PROBLEM - Free space - all mounts on tools-webgrid-02 is CRITICAL: CRITICAL: tools.tools-webgrid-02.diskspace._var.byte_percentfree.value (<22.22%) [03:27:00] 3Wikibugs: Repond to "CTCP SOURCE" and "help" commands - https://phabricator.wikimedia.org/T88070#1003489 (10awight) 3NEW [03:36:43] (03PS1) 10Awight: Send fundraising stuff to our channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/187647 (https://phabricator.wikimedia.org/T88071) [04:26:20] (03CR) 10Legoktm: [C: 032] Send fundraising stuff to our channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/187647 (https://phabricator.wikimedia.org/T88071) (owner: 10Awight) [04:26:39] (03Merged) 10jenkins-bot: Send fundraising stuff to our channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/187647 (https://phabricator.wikimedia.org/T88071) (owner: 10Awight) [04:28:51] !log tools.wikibugs Updated channels.yaml to: 4fe2e5b9f9d699d3547aba5b320fdf9ce1bd96b0 Send fundraising stuff to our channel [04:28:57] Logged the message, Master [06:30:46] 3Tool-Labs: Add wiki title case sensitivity flag (is_sensitive) to meta_p.wiki to support jbo.wp and wiktionary tools - https://phabricator.wikimedia.org/T69476#1003763 (10jayvdb) [08:12:21] 3Wikimedia-Labs-wikitech-interface, Engineering-Community: Wikitech registration requires labs shell access - https://phabricator.wikimedia.org/T88092#1003882 (10Tgr) 3NEW [08:12:53] 3Wikimedia-Labs-wikitech-interface, Engineering-Community: Wikitech registration requires labs shell access - https://phabricator.wikimedia.org/T88092#1003889 (10Tgr) [08:13:57] 3Wikimedia-Labs-wikitech-interface, Engineering-Community: Wikitech registration requires labs shell access - https://phabricator.wikimedia.org/T88092#1003882 (10Tgr) Also the registration form is enclosed in ``/`` - don't know if that's intentional or a bug, but looks weird. [08:41:57] PROBLEM - Free space - all mounts on tools-exec-13 is CRITICAL: CRITICAL: tools.tools-exec-13.diskspace._var.byte_percentfree.value (<55.56%) [08:45:36] Could someone tell me if the job runner is running? I'm trying to do a test upload with the GWT on Beta-Commons and it doesn't seem to be going through... [08:46:16] tgr|away, ^ [08:48:44] hi wittylama [08:49:37] wittylama, https://phabricator.wikimedia.org/P244 [08:51:53] Krenair - so, that means the job was received/submitted successfully but the job runner is not processing? [08:53:26] seems like it [08:55:24] okay, well there's a job runner VM actually running. [08:55:52] It doesn't seem particularly busy... [08:57:52] https://graphite.wmflabs.org/render/?title=deployment-jobrunner01+CPU+last+month&width=400&height=250&from=-1month&hideLegend=false&uniqueLegend=true&target=alias%28color%28stacked%28deployment-prep.deployment-jobrunner01.cpu.total.user.value%29%2C%22%233333bb%22%29%2C%22User%22%29&target=alias%28color%28stacked%28deployment-prep.deployment-jobrunner01.cpu.total.nice.value%29%2C%22%23ffea00%22%29%2C%22Nice%22%29&target=alias%28color%28stacked%28d [08:57:52] eployment-prep.deployment-jobrunner01.cpu.total.system.value%29%2C%22%23dd0000%22%29%2C%22System%22%29&target=alias%28color%28stacked%28deployment-prep.deployment-jobrunner01.cpu.total.iowait.value%29%2C%22%23ff8a60%22%29%2C%22Wait+I%2FO%22%29&target=alias%28alpha%28color%28stacked%28deployment-prep.deployment-jobrunner01.cpu.total.idle.value%29%2C%22%23e2e2f2%22%29%2C0.4%29%2C%22Idle%22%29 looks suspicious. [08:58:07] oh wow, big url [08:58:14] gj graphite [09:01:52] krenair@deployment-jobrunner01:~$ service jobrunner status [09:01:52] jobrunner stop/waiting [09:01:54] wat. [09:01:55] wittylama, ^ [09:02:46] won't let me start it either. hmm [09:03:51] I wonder if https://gerrit.wikimedia.org/r/#/c/185939/ is to blame [09:03:55] YuviPanda|flight, ^ [09:17:06] wittylama, do you have a phabricator account? [09:19:13] wittylama, filed https://phabricator.wikimedia.org/T88094 anyway [09:20:03] Krenair - I do now. [09:20:10] I need to get some sleep because I have an important meeting in the office tomorrow for my actual job [09:20:15] local time is 01:20 [09:20:18] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [09:23:38] Krenair - thanks for your help. [09:29:45] andrewbogott: considering Coren did not implement any of my suggestions back then, and the fact that Ganglia got broken again only a few weeks after I fixed it... I'm not very motivated to help out [09:29:46] andrewbogott: page Coren, I'd say. [09:41:58] <`fox`> hey, I have issues with my tools hosted on wikimedia labs. The webservice does not start [09:42:07] <`fox`> example: https://tools.wmflabs.org/wikitrip/api.php [09:42:17] <`fox`> legoktm, ^ [09:42:38] <`fox`> it says there is no webservice even though i restarted it [09:50:07] `fox`: all normal webservers are overloaded because of nfs problems. that why no new webserver are currently startet (except for release=trusty) [09:59:14] <`fox`> thank you Merlissimo [09:59:30] <`fox`> in fact also a normal "ls" is pretty slow [09:59:37] <`fox`> i hope it will be fixed soon [10:00:46] there seems to be only one admin who is able to fix this problem for sure, but he is ill [10:07:59] <`fox`> lol [12:28:45] 3Labs: Puppet logs should be timestamped in a human-readable way - https://phabricator.wikimedia.org/T88108#1004161 (10scfc) 3NEW [12:42:56] 3Tool-Labs, Labs: Fix Labs' PAM config mess - https://phabricator.wikimedia.org/T85910#1004185 (10scfc) I agree with @coren on the global variable bit. In Tools we exclude non-admins from "infrastructure" hosts more out of precaution, but for example for the proxy host this is mandatory to lock down the local R... [12:44:26] Merlissimo: bus factor :( [12:44:52] although I'd expect there to be more labs admins who can fix the nfs servers [12:44:59] but everybody is probably traveling [12:46:51] valhallasw`cloud: i wrote fix for sure. This doesn't exclude other admins with a "try and error"-way solution [12:47:18] which is the normaler way for computer scientists [12:47:21] :D [12:50:28] the source is andrewbogott who wrote "I don't want to run the risk of it not coming back up." on ml [13:31:41] 3Wikimedia-Labs-wikitech-interface, Labs: Wikitech registration requires labs shell access - https://phabricator.wikimedia.org/T88092#1004229 (10chasemp) [13:41:49] PROBLEM - Puppet staleness on tools-submit is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [15:18:53] Someone could apply my public key?, please [15:22:26] https://wikitech.wikimedia.org/wiki/Shell_Request/The_Photographer [15:22:29] is not done [15:28:05] The_Photographer: you should do that yourself, under https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [15:29:22] The_Photographer: you have shell rights: https://wikitech.wikimedia.org/w/index.php?title=Special%3AListUsers&username=The+Photographer&group=&limit=50 [15:38:14] valhallasw`cloud: ok, thanks [15:40:41] hi, I am having issues with connecting to bastion.wmflabs.org via ssh -- I tried the help pages, but found no solution -- anyone can help? [15:45:50] no problem, will try other ways [16:31:55] RECOVERY - Free space - all mounts on tools-exec-13 is OK: OK: All targets OK [17:02:59] when I execute a python file from labs it returns an error: "not an executable file" [17:03:06] how can I execute a python file? [17:03:33] either chmod +x file.py or 'python /path/to/file.py' [17:03:44] python /path/to/file.py [17:04:02] well, I cd to the path and after I run the .py file [17:04:53] what's a chmod? [17:06:14] ah, ok [17:06:16] I see XD [17:06:27] so, how can I execute a .py file? [17:16:24] Anyone use socketIO_client package? I see it isn't installed, and if I install it on my directory in labs, it fails due to "from six.moves.urllib.parse import urlparse" [17:22:09] 3Tool-Labs: bigbrother only watches users jobs if they already have a job running - https://phabricator.wikimedia.org/T88122#1004389 (10scfc) 3NEW [17:24:15] 3Labs: Create xtools project on Labs with domain xtools.wmflabs.org - https://phabricator.wikimedia.org/T88123#1004399 (10Cyberpower678) 3NEW [17:24:24] 3Labs: create ORES project in Labs - https://phabricator.wikimedia.org/T87494#1004406 (10Halfak) We already have a project called 'revscoring'. See https://wikitech.wikimedia.org/wiki/Nova_Resource:Revscoring [17:42:17] Hi! I'm trying to connect to be-x-oldwiki.labsdb with mysqli in php, but it fails and I get this: Warning: mysqli::mysqli(): (HY000/2005): Unknown MySQL server host 'be-x-oldwiki.labsdb' (0) [17:42:19] The script works with e.g. enwiki.labsdb or dewiki.labsdb etc. Anything special for be-x-old.wikipedia.org? [18:06:42] Anyone know how to execute a python file from labs? [18:07:23] NeoMahler: can you be more specific? What are you currently seeing? [18:07:31] And, by ‘labs’ to you mean ‘tools’ or are you someplace else? [18:07:36] yes yes [18:07:38] tools [18:07:54] You want to just run it as a one-off or is it a persistent thing like a bot or a server or something? [18:08:10] when try to run it, I get "not an executable file" on filer .err [18:08:13] is an IRC bot [18:08:15] persistent [18:08:38] ok. to run a python file once, you can just type ‘python ’ [18:08:42] or modify the file so it’s executable. [18:08:52] But you shouldn’t run persistent things on tools-login; that should be submitted to the grid. [18:08:59] oh [18:09:05] I don't know what's the grid... [18:09:27] tools uses a grid engine that distributes jobs among a variety of hosts. Load-balancing and such. [18:09:40] ah [18:09:41] It’s ok to test your script locally until you see it working properly. [18:09:50] ok [18:10:01] Do you have an actual tool group set up, or are you just logged in as yourself at the moment? [18:10:10] I have a group [18:10:14] cool [18:10:18] Here are docs about the grid: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid [18:10:18] well, I'm alone on the group, but... XD [18:10:23] ok [18:10:26] thanks! [18:10:28] But really it should be as simple as ‘jsub ' [18:10:42] jsub python file.py ? [18:10:56] Approximately; I’m not sure about quoting :) [18:11:01] ok XD [18:11:04] oops [18:11:19] When I run it withou jsub, it returns me a traceback... [18:11:23] "permission denied" [18:11:28] Also be warned that that means your script will run on an arbitrary host. So any file IO should happen on a volume that’s shared amongst the hosts. LIke /data/project/ [18:11:36] ok [18:11:46] permission denied could be… anything :) [18:11:51] NeoMahler: My command is: [18:11:54] jsub -once -N csd_report python /data/project/betacommand-dev/svn_copy/sql_csd.py [18:12:10] ok [18:12:42] andrewbogott, the config file of the bot is in another file... it seems that he can't open it [18:12:45] it generates http://tools.wmflabs.org/betacommand-dev/reports/CATCSD.html [18:13:13] NeoMahler: ok… probably you need to ensure that that config file is owned by the tool. [18:13:22] That would be something like ‘chown ' [18:13:29] ok [18:13:52] invalid user!! [18:14:43] Hm… all the files used by the tools.morebots bot are owned by ‘tools.morebots’ [18:14:59] so, pretty sure my advice is right. What command did you use? [18:15:12] chown [18:15:25] NeoMahler: where is the config file stored? [18:15:25] the full commandline? [18:15:50] andrewbogott, chown rc-vikidia default.py [18:16:06] NeoMahler: what does ‘whoami’ say? [18:16:11] Betacommand, in .../project/rc-vikidia/public_html [18:16:20] tools.rc-vikidia [18:16:23] ah! [18:16:25] so, there you go [18:16:33] I have to put chown tools.rc-vikidia? [18:16:39] you should chgrp as well. So that the file’s owner and group are both tools.rc-vikidia [18:16:40] yep [18:17:16] NeoMahler: do you understand vaguely what this is doing? [18:17:25] File and group ownership, etc? [18:17:51] yes [18:18:00] mmhhh.. "chown: cannot access `config/default.py': Permission denied" [18:18:39] hm… try ‘take ’ [18:18:43] not sure if that will work, it might [18:19:11] oh! [18:19:28] I can't access the config folder... [18:19:31] with cd command [18:19:45] ok, so you probably need to be whatever user made that file in the first place [18:19:47] the current owner [18:19:51] in order to change the ownership [18:19:59] but ‘take’ might also work on the directory, I’m not sure [18:20:02] I'm the owner... [18:21:18] with take: default.py: You need to share a group with the file [18:22:15] ok. So, right now you’re not logged in as yourself but as the tool. So you need to be yourself, and change the owner and group that way [18:23:01] oh [18:23:17] so how ca I do this? [18:24:07] well… you’re clearly logged in as the tool now. based on what ‘whoami’ says. [18:24:17] But you must’ve ssh’d in as yourself, right? [18:24:28] mmhhh... yes [18:24:41] the putty makes it automatically [18:30:25] ah! [18:30:33] andrewbogott, now the problem is different [18:30:37] great! [18:30:39] I can access to config folder [18:30:42] That sounds like progress :) [18:30:47] but "chown: changing ownership of `default.py': Operation not permitted" [18:30:59] yes but the problem is still problemathic XD [18:31:00] who are you logged in as? [18:31:16] rc-vikidia [18:31:34] how did you… get access to the directory? [18:31:39] cd [18:31:48] I managed the members [18:31:49] I thought you didn’t have access before? [18:31:55] Oh, I see. [18:31:59] in "service users" it was "rc-vikidia" [18:32:02] I quitted it [18:32:04] so, does ‘take’ work now? [18:32:28] default.py: You need to share a group with the file [18:32:55] ok. Let’s back up a bit… do you understand why you can’t modify the file? [18:33:10] I can modify [18:33:15] with WinSCP [18:33:23] well, I think... [18:33:25] one second [18:33:35] I mean, why you can’t modify from the shell. [18:33:36] ah, no [18:33:40] I can't... [18:33:44] no [18:33:52] OK. So, type ‘ls -ltr’ in the config dir. [18:33:53] because I'm not the owner? [18:34:08] What is your shell name in labs? [18:34:19] shell name? [18:34:35] username? Unapersona [18:34:39] unapersona* [18:34:42] Yes, you selected a shell name when you created your labs account. It is the name you use to log into instances. [18:34:49] ah, ok [18:34:53] unapersona [18:34:54] As opposed to the name you use to log into wikitech, which is potentially different. [18:34:57] ok. [18:35:03] with ls -ltr: total 4 [18:35:04] -rwxrw-r-- 1 unapersona wikidev 874 Jan 28 20:26 default.py [18:35:23] OK, great. So you can see that the file is owned by ‘unapersona’ and is in group ‘wikidev’. [18:35:27] Now, if you type ‘whoami’ [18:35:48] tools.rc-vikidia [18:35:58] you’ll see that /you/ are not ‘unapersona’ but rather ‘tools.rc-vikidia’ [18:35:58] ah, I have to become unapersona? [18:35:59] right. [18:36:10] But only unapersona has access to the file [18:36:10] ok [18:36:10] right. [18:36:29] "become: no such tool 'unapersona'" :/ [18:37:16] Ok — so when you first connected to tools-login you were ‘unapersona’ and then you did ‘become rc-vikidia’ right? [18:37:20] yes [18:37:28] So you just need to undo that ‘become’ which I think you can do with ‘exit’ [18:38:30] ok, done [18:38:44] now I'm unapersona@tools-dev:~$ [18:38:53] so now as unapersona can you chown that file? [18:39:58] but how can I access to rc-vikidia folder? [18:40:13] If i do "cd rc-vikidia" "-bash: cd: rc-vikidia: No such file or directory" [18:40:35] ‘pwd’ will show you what your current directory is. [18:40:42] ‘ls’ will show you the contents of that directory. [18:40:52] ok [18:40:59] ‘cd’ can change your current directory. [18:41:15] and cd.. to return? [18:41:45] This is kind of like a venn diagram. Because ‘unapersona’ is a member of ‘rc-vikipedia’, unapersona shouldbe able to read any files that are owned by rc-vikipedia. [18:41:49] The reverse is not true,though. [18:41:57] “cd ..” will take you up one dir level. [18:42:03] ok [18:42:12] “cd..” is a windowsism that doesn’t work on linux [18:42:19] aahh ok [18:42:41] "-bash: cd: config: Permission denied" [18:43:20] ok, it’s pretty hard for me to tell where you are or what’s happening :) [18:43:38] I'm on data/project/rc-vikidia [18:43:42] As unapersona you should be able to do “chown tools.rc-vikipedia” /full/path/to/file [18:43:46] when I do "cd config" it says me that [18:43:51] huh [18:43:53] mmhhh... [18:43:58] ok, let me log in and see what’s happening [18:44:04] ok [18:45:10] so, the directory ‘config’ is owned by tools.rc-vikipedia. But the file inside is owned by unapersona. [18:45:14] Not sure how we got there :) [18:45:25] And I’m surprised you can’t read the dir. [18:45:47] :O [18:46:01] I can read the folder, but not the file [18:46:47] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [18:47:00] ok, I’ve modified default.py so it’s owned by tools.rc-vikidia [18:47:07] ok! [18:47:09] So you should be able to ‘become tools.rc-vikidia’ and run your script. [18:47:22] let's try :) [18:47:28] But I’m not sure how to resolve this in the future. Probably best to scp files into the base dir (rc-vikidia) for simplicity. [18:47:43] ok [18:48:02] yes! :D [18:48:09] then once you have ownership sorted out you can move them as needed with ‘mv’ [18:48:20] (which is roughly like ‘rename’ on windows) [18:48:40] well, I do it with WinSCP [18:49:03] and... ehem... [18:49:22] to wxit the python shell and return to tools shell? XD [18:49:46] ctrl-d [18:50:13] ctrl-c XD [18:50:28] it was jsub -once? [18:50:49] NeoMahler: I don’t know what -once does, best to read the docs [18:50:59] ok [18:51:17] Is this fingerprint correct for bastion.wmflabs.org? https://wikitech.wikimedia.org/w/index.php?diff=108036 [18:52:02] When I run "ssh he7d3r@bastion.wmflabs.org" (which I don't know if I'm supposed to do, actually), a different fingerprint appears [18:52:09] Helder: I think this page is up to date: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [18:52:11] "ECDSA key fingerprint is ea:b9:9f:e7:22:f9:94:18:8c:98:1d:69:c9:40:a1:a7." [18:52:13] If you find otherwise, please let me know [18:52:23] andrewbogott, ^ [18:52:47] huh, sure enough. [18:52:49] One second [18:53:03] :-) [18:53:05] Helder: I get debug1: Server host key: RSA 9d:48:7e:d8:89:49:0f:2d:39:6d:af:5e:23:02:aa:f7 [18:53:09] for bastion.wmflabs.org [18:53:30] Ah, valhallasw`cloud, /that/ is the fingerprint on the webpage as well. [18:53:33] So what is Helder seeing? [18:53:46] man-in-the-middle! [18:53:53] ;-) [18:54:09] Helder: bastion.wmflabs.org resolves to 208.80.155.129 for me [18:54:49] Helder: the fact it's ECDSA and not RSA suggests it's a trusty host [18:55:00] (tools-login has an RSA key, trusty.tools has ECDSA) [18:56:39] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [18:57:39] valhallasw`cloud, $ host bastion.wmflabs.org [18:57:40] bastion.wmflabs.org has address 208.80.155.129 [18:58:08] Helder: anything in your .ssh/config? [18:59:22] Helder: alternatively, maybe pastebin the output of ssh -vv bastion.wmflabs.org ? [19:01:09] valhallasw`cloud, http://dpaste.com/1TS65JZ [19:04:21] Helder: if you try ‘bastion1’ do you get the same key or a different one? [19:04:27] (It should be the same box) [19:05:10] valhallasw@bastion1:~$ ssh-keygen -l -f /etc/ssh/ssh_host_ecdsa_key.pub [19:05:10] 256 ea:b9:9f:e7:22:f9:94:18:8c:98:1d:69:c9:40:a1:a7 root@bastion1 (ECDSA) [19:05:25] ok, so what's happening is that for some reason you get the ECDSA fingerprint instead of the RSA one [19:05:39] valhallasw`cloud, it is the same "ea:b9:9f:e7:22:f9:94:18:8c:98:1d:69:c9:40:a1:a7" for bastion1 [19:05:41] (I didn't know servers could have two fingerprints for the same ssh server O_o) [19:06:00] valhallasw`cloud: me neither, seems unhelpful [19:06:21] So, nothing dangerous is happening, but something weird is happening with Helder’s ssh. [19:06:37] Helder: http://askubuntu.com/questions/133172/how-can-i-force-ssh-to-give-an-rsa-key-instead-of-ecdsa [19:07:56] OK, since no one is getting hacked, I’m going to go get lunch (aka ‘breakfast’). Back in ~an hour [19:08:26] my current config is this: http://dpaste.com/3PHBAP4 [19:08:29] Helder: if you don't trust me, use that method to force an RSA handshake, if you do trust me, just connect :-) [19:08:49] Helder: it's something with the system-wide config disabling the RSA handshake, I think [19:08:54] what ubuntu are you running? [19:09:05] 14.10 [19:09:12] (ubuntu studio) [19:14:03] Helder: hm, I'm running 14.04, so they might have changed the config in the meanwhile [19:16:09] Helder: can you post your /etc/ssh/ssh_config ? [19:17:14] oh, found the .deb [19:18:18] valhallasw`cloud, http://dpaste.com/0ZMBJF1 [19:18:42] yep, that's the same as in the .deb /and/ the same as in my ssh_config [19:24:42] valhallasw`cloud, "ssh -oHostKeyAlgorithms='ssh-rsa' bastion.wmflabs.org" gives me the RSA key fingerprint is 9d:48:7e:d8:89:49:0f:2d:39:6d:af:5e:23:02:aa:f7, as expected. [19:26:39] However, I don't think I have access to bastion.wmflabs.org after all ("Permission denied (publickey)." after adding it to the list of known hosts) [19:27:05] replag for dewiki is increasing on all three db servers at the same time. currently about 14 minutes [19:31:50] valhallasw`cloud, do you know how can I access https://wikitech.wikimedia.org/wiki/Nova_Resource:Revscoring once I'm added to the list of members? [19:32:12] Helder: through bastion [19:34:13] Helder: you're in the shell group, so I think you should have access [19:35:50] valhallasw`cloud, I think I know what I did wrong: I forgot to add "he7d3r@" in "ssh he7d3r@bastion.wmflabs.org" [19:36:08] Helder: ah. Yeah, that would cause those issues [19:36:14] After adding it, the "Permission denied" goes away [19:36:22] :-) [19:37:56] So, where should I look for the project now that I'm seeing "he7d3r@bastion1:~$"? [19:39:13] Helder: oh! the project actually doesn't have any running instances, so you can't ssh anywhere [19:39:27] someone first needs to actually spin up a vm [19:40:00] ah, ok :-) [19:40:59] I've never seen a new project before, so I was just looking around... [19:49:11] 3Tool-Labs: Investigate using monit to replace bigbrother - https://phabricator.wikimedia.org/T76840#1004934 (10scfc) I looked a bit around and am not really convinced. I may be biased because the [[http://mmonit.com/monit/|monit website]] welcomes one with this Web 2.0 big print layout which (IMHO) looks reall... [19:54:52] Not able to log into bastion https://phab.wmfusercontent.org/file/data/4fhmt2e7emlw3mysztos/PHID-FILE-o6cl5zh6ylh5ct5b3fan/34waxlg4fznwta4y/bastion [19:55:16] Re-added my SSH keys to be sure, still doesn't work [19:58:08] When I change me username from prtksxna to Prtksxna I get "Permission denied (publickey)." instead [20:46:58] prtksxna: I can help you debug — can you try to log in once more while I watch the log? [21:01:45] andrewbogott: I did, just now [21:01:56] ok, looking... [21:02:02] andrewbogott: I have a couple of meetings right now :( [21:02:27] andrewbogott: Will you be here after 3hrs? [21:02:34] prtksxna: it says ‘invalid user prateeksaxena' [21:02:42] so that should be pretty easy to fix — just figure out your shell name :) [21:02:51] prtksxna: I’ll probably be gone by then [21:02:55] andrewbogott: I have prtksxna set almost everywhere [21:03:01] Hi [21:03:10] andrewbogott: WORKS! [21:03:12] andrewbogott: <3 [21:03:15] * YuviPanda|flight waves from airport [21:03:18] that was easy [21:03:30] * prtksxna goes to meetings [21:03:31] YuviPanda|flight: you flying via Atlantic or Pacific? [21:03:45] Atlantic as always [21:03:51] Ah, so you’re in London/ [21:03:53] ? [21:04:03] andrewbogott: nope reached BLR [21:04:17] Standing in line for Ebola screening [21:04:25] Reading backlog [21:04:29] Oh! Welcome home, in that case [21:04:44] you staying in Bangalore for the next little while? Or do you have a bus ride ahead of you? [21:05:46] andrewbogott: Bangalore for about 3 days I think. But not sure where I am going to stay in BLR since I didn't call ahead [21:05:55] oops :) [21:06:05] Yeah [21:06:40] YuviPanda|flight: the main story with tools is that latency is super high for NFS. I want to restart labstore1001 but I’m reluctant since I’m not clear on if it will come back up without post-reboot intervention [21:07:10] I don’t know if a reboot will actually fix the latency, though, it may be that there’s a badly-behaved tool chewing up bandwidth as well. [21:09:48] andrewbogott: we have network stats on graphite for it [21:10:07] I checked before I left sf and it wasn't spiking [21:11:27] All I ever see on graphite is a (seemingly bottomless) nest of folders [21:12:28] what report did you look at? [21:15:27] YuviPanda|flight: do you have any intuition about restarting labstore1001? All I know is that the wiki docs say I have to run nfs-start and that there is no such script [21:18:26] andrewbogott: graphite.wikimedia.org. under servers [21:19:06] servers:labstore1001:network:bond0:??? [21:19:20] Yeah [21:19:28] Think so [21:21:42] My point is that even under bond0 there are a million options [21:22:16] OK, not a million :) 18 [21:27:03] andrewbogott: ah. tx_byte and Rx_byte I think [21:31:11] YuviPanda|flight: so, what does it mean if NFS latency on the client is super high, but network activity on labstore is normal? [21:31:24] (Not that I can really tell what’s ‘normal’ from those graphs) [21:31:59] andrewbogott: moment,setting up laptpp [21:32:16] Are you through immigration now? Pretty quick. [21:35:58] andrewbogott: basically I just stayed back and let all the crowd go through before going through myself [22:28:52] YuviPanda|airpor / Coren: dewiki replag is about 150 minutes on all three db servers [22:29:58] or andrewbogott ^^ [22:32:02] replag was some times more than ten minutes during this day, but more than two hours is a bit too much for good results for most bots and web tools [22:32:26] Merlissimo: I'll take a look I'm about 10min? Just got off a really long flight and having some food [22:32:26] Sorry [22:32:30] This has been a bad week for tools [22:34:22] YuviPanda|airpor: 10 minutes is less than 8% of the replag ;-) [22:34:50] i pinged Sean first, but he seems to be offline. [22:44:51] Merlissimo: he is either on a flight or on the way to / from one I suppose [22:44:59] I've never really debugged replag issues but I'll try [22:47:16] it is still replicating, but really slow, that's why replag is increasing (about 4min/10min in the last half hour) [22:48:10] right [22:48:24] looking now [22:50:27] Merlissimo: I see it. lag on replication from db masters to sasnitarium [22:50:41] (db1069) [22:53:13] I see a spike in CPU usage since about 8h ago [22:54:04] ah, hmm [22:54:04] https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&c=MySQL+eqiad&h=db1069.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [22:59:27] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:00:33] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:00:48] any maintanace script running? [23:00:58] on master [23:02:08] Merlissimo: they are all deadlocked on retrying deletion of particular pages [23:02:21] Merlissimo: but the other db slaves aren’t’ dedlocked [23:04:46] there ist a missing rows problem since wednesday. perhaps this is related. either the deadlock cases missing rows or the missing rows are dealocking deletion request because they don't match a row [23:05:12] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005375 (10yuvipanda) [23:05:13] 3Tool-Labs: Show replication lags in Ganglia - https://phabricator.wikimedia.org/T50694#1005373 (10yuvipanda) 5Resolved>3Open Ganglia has been dead for a while now. [23:05:38] Merlissimo: yeah, that’s possible [23:05:47] Merlissimo: do you have links to previous times this has been a problem and sean had fixed it:? [23:07:12] i pingt sean at 19:31 with a replag of 900 seconds but this was gone at 19:58 [23:07:26] utc [23:08:56] and at 20:44 replag was 700 second according to my shell log [23:12:20] YuviPanda|airpor, did you see my bug about deployment-prep's job runner? [23:12:39] Krenair: no... [23:12:39] is it dead? [23:12:48] very dead [23:13:00] link / bug? [23:13:03] you start the service as root and it says it started [23:13:07] then check the status, it's stopped [23:13:15] https://phabricator.wikimedia.org/T88094 [23:13:29] timing seems to correlate with your operations/puppet change [23:13:51] krenair@deployment-jobrunner01:~$ mwscript showJobs.php --wiki=enwiki [23:13:52] 17491 [23:16:22] Krenair: hmm, so I’m at the BLR airport and it’s 5AM, so I’ll probably not be of much use on that for a while. Am trying to debug mysql replag issues as well [23:16:34] okay [23:16:47] Krenair: sorr! I’ll take a look whenever I wake up? [23:17:00] Merlissimo: is there already a bug for this or should I file one? [23:17:03] sure [23:17:52] YuviPanda|airpor: new one [23:18:07] Merlissimo: let me do that. [23:24:31] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [23:30:32] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [23:36:31] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005558 (10yuvipanda) 3NEW [23:36:45] Merlissimo: ^ am adding info as I find them, but I don’t know if I can fix it without sean :( [23:39:44] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005571 (10Merl) [23:39:47] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1005570 (10Merl) [23:41:16] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005582 (10yuvipanda) Hmm, there's no 1:1 correspondence between the slices that are lagging and the ones that are reporting deadlock errors (s2 in particular). There was a huge spike in network traffic / CPU usage on db1069 sinc... [23:53:31] 3Tool-Labs, operations: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1005606 (10yuvipanda) replag is still growing, but my battery is running out, the aiport people are looking at me suspiciously, and my brain isn't at its best after a 26h flight. Hopefully @springle can take a look soon, if not I... [23:53:46] Merlissimo: I’ve to go now, I think. sorry! [23:54:02] my battery is dying anyway [23:54:05] YuviPanda|airpor: thanks a lot [23:54:30] Merlissimo: I’ve kept some notes on the phab ticket, I’ll take a look again in a while when I’m back home. [23:54:50] Merlissimo: am thinking we should set up a ‘how to make toollabs’ better consultation or something like that and commit to working on all of those. [23:54:59] as it is now it definitely needs a lot more love [23:55:04] * YuviPanda|airpor is off for now [23:55:07] Merlissimo: thanks for reporting it!