[04:10:52] 6Labs, 10wikitech.wikimedia.org: Exclude nova resource pages from *default* wikitech search - https://phabricator.wikimedia.org/T122993#1921248 (10bd808) >>! In T122993#1918848, @valhallasw wrote: > I just realized this also excludes SALs (and project index pages!) from search, and SALs tend to be provide usef... [04:43:17] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [05:48:17] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [05:54:15] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [07:14:16] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [07:20:17] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [08:04:34] 10Labs-project-extdist, 5Patch-For-Review: Puppet is not running on extdist instances - https://phabricator.wikimedia.org/T123090#1921398 (10Legoktm) 5Open>3Resolved a:3Legoktm [08:05:16] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [08:11:16] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [08:16:14] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [09:54:43] 6Labs, 10Tool-Labs, 10DBA: Faebot is crashing labsdb1002 - https://phabricator.wikimedia.org/T119604#1921455 (10Fae) As an example of the impact of the throttling, the "thanks report" https://meta.wikimedia.org/wiki/User:Faebot/thanks has not been updated since its last run in August 2015. There was going to... [10:28:44] 6Labs, 10Tool-Labs, 10DBA: Unthrottle faebot - https://phabricator.wikimedia.org/T123108#1921489 (10valhallasw) 3NEW [10:29:04] 6Labs, 10Tool-Labs, 10DBA: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#1921503 (10valhallasw) [10:29:06] 6Labs, 10Tool-Labs, 10DBA: Faebot is crashing labsdb1002 - https://phabricator.wikimedia.org/T119604#1921500 (10valhallasw) 5Open>3Resolved [10:31:27] 6Labs, 10Tool-Labs, 10DBA: Faebot is crashing labsdb1002 - https://phabricator.wikimedia.org/T119604#1921505 (10valhallasw) @Fae: please create new tasks rather than re-using existing ones -- changing the task description is likely not to be noticed, while a new task is. I've done that in {T123108}. Thanks! [11:22:15] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [11:47:18] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [11:53:17] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [12:08:14] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [14:48:18] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [16:13:16] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [16:19:17] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [16:22:32] YuviPanda: ^ I'm connected without issues, no crazy load or something like that. Internal labs network issues? :/ [16:39:15] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [16:44:48] tools-bastion-02 is running out of memory [16:44:57] python and nscd seem to hog quite a bit [16:45:18] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [16:45:54] YuviPanda, andrewbogott [16:56:00] looks like not only bastion-02: http://tools.wmflabs.org/?status says: [16:56:09] tools-exec-1202 Load: 219.75% Memory: 56.4% Free vmem: 0M [16:57:45] abartov: it has over a gigabyte free? [17:00:50] icelab is using quite some cpu, yes [17:01:14] more than 100% of the CPU? Is that possbile? [17:01:18] *possible [17:03:02] usually means multiple cores [17:03:15] ah, ok [17:03:37] I think that cpu number is 'load/ncpu' [17:06:06] linkwatcher seems to have gone rogue (the jobs are not running under SGE, it seems) [17:07:31] so… is there an actual problem on bastion-02? I don’t know what’s normal [17:08:16] valhallasw`cloud: no, it has ~130M free. [17:09:02] andrewbogott: yes, there's a problem. A relatively mundane task of mine crashed on failing to allocate memory. [17:09:49] looks at processes 24967, 11898, 9806, etc. [17:15:26] s/looks/look, andrewbogott [17:15:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:30] icebot is again using a whole bunch of resources [17:19:44] and every now and then it apparently peaks to ~130M free [17:22:40] jyothis: dewiki has a high replag [17:23:36] !log tools killing tools.icelab as per https://wikitech.wikimedia.org/wiki/User_talk:Torin#Running_queries_on_tools-dev_.28tools-bastion-02.29 [17:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [17:24:58] abartov: is that better? I’m thinking I should probably head for the office. [17:25:18] andrewbogott: sure, not urgent. [17:25:24] andrewbogott: but only very slightly better. [17:25:49] what is hazard, and why is it taking up 5G? [17:25:58] abartov: no idea. [17:26:17] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1921762 (10valhallasw) 3NEW [17:27:07] abartov: 500M, not 5G [17:27:14] 0.3% mem, if I can believe top [17:27:20] no, that's even less [17:27:37] that's just 24M [17:27:48] abartov: where did you get the 5G number? [17:28:37] * andrewbogott out for now [17:28:39] VIRT in top(1) [17:29:55] virtual memory is memory too. When too much of it is hogged, other things cannot be swapped out to fulfil an allocation request. [17:33:28] mm. [17:35:09] ...triangle.py is empty. What on earth [17:42:06] 6Labs, 10Tool-Labs: overcommit disabled + small swap on tools-bastion-02 - https://phabricator.wikimedia.org/T123123#1921779 (10valhallasw) 3NEW [17:42:19] thanks, valhallasw`cloud [17:43:20] 6Labs, 10Tool-Labs: python3 triangle.py causing high memory usage on tools-bastion-02 - https://phabricator.wikimedia.org/T123122#1921787 (10valhallasw) In any case, due to {T123123}, other users could not use the bastion because of this. [17:52:20] 6Labs, 10Tool-Labs: python3 triangle.py causing high memory usage on tools-bastion-02 - https://phabricator.wikimedia.org/T123122#1921796 (10valhallasw) The script is now non-empty, but also definitely non-wikimedia. @Hazard-SJ, please remember this is not allowed per the [[ https://wikitech.wikimedia.org/wiki... [18:38:48] chasemp, YuviPanda, either of you around? Do we have any explanation for those pages other than ‘it worked for me’? [18:40:07] no although there was talk of an exec node hammering nfs in -ops [18:40:14] not sure how well that correlates [18:41:11] 1202? [18:41:30] 1206 was the one I saw [18:42:55] OK (1202 had > ncpu load, which suggested high io) [18:46:26] So that would be 'PROBLEM - High load average on labstore1001 is CRITICAL' [18:46:28] the timing fits [18:48:14] when was this? I don't see any discussion on it in #wikimedia-operations, and exec-1206 doesn't seem to have any large i/o going on. But we have the accounting file, so it should be possible to pinpoint which job this was. Or maybe I'm just late to the party, and this has all already been done? [18:53:11] it was newbot.py but my logs lack time atm [18:53:28] right, sorry it was in _security and the gist was that it was gone by the time anyone was looking [19:02:05] sql enwiki_p; just stopped working for me :-( [19:12:26] 6Labs, 10DBA: Labs: High Replication Lag - https://phabricator.wikimedia.org/T123130#1921893 (10Krenair) [19:19:52] chasemp: thanks! [19:26:56] I can't find anything 'newbot' in the accounting logs, unfortunately :( [19:35:15] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [19:41:15] PROBLEM - SSH on tools-mail-01 is CRITICAL: Server answer [19:41:24] 6Labs, 10Tool-Labs: tools-mail-01 SSH check flapping - https://phabricator.wikimedia.org/T123133#1921938 (10valhallasw) 3NEW [19:42:36] 6Labs, 10Tool-Labs: tools-mail-01 SSH check flapping - https://phabricator.wikimedia.org/T123133#1921946 (10valhallasw) This time something //is// wrong: ``` valhallasw@maeglin:~$ ssh tools-mail-01. ssh_exchange_identification: Connection closed by remote host ``` [19:46:04] !log couldn't get into tools-mail-01 at all and it seemed borked so I rebooted [19:46:04] couldn't is not a valid project. [19:46:10] ah [19:46:13] !log tools couldn't get into tools-mail-01 at all and it seemed borked so I rebooted [19:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:48:57] uuuh. /var/log/auth.log seems to be missing a significant piece [19:49:43] either I was not awake this afternoon and was connecting to a different host, or the log somehow got borked in the reboot, or there's something seriously wrong [19:50:37] Anyone else experiencing timeout lag right now with login? [19:50:56] ResMar: login where? [19:51:11] tools-login seems OK for me, although there are quite some users running tools there it seems [19:51:15] RECOVERY - SSH on tools-mail-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [19:51:30] valhallasw`cloud: ssh -i etc. is timing out for me at the moment. [19:51:38] 6Labs, 10Tool-Labs: tools-mail-01 SSH check flapping - https://phabricator.wikimedia.org/T123133#1921950 (10valhallasw) @chasemp rebooted the box and tools-mail-01 is back online. Oddly enough, /var/log/auth.log seems to be missing a significant chunk of data: ``` Jan 7 15:02:47 tools-mail-01 sudo: pam_unix(... [19:51:48] ResMar: which host? [19:51:48] Huh. Worked on the tenth attempt. [19:52:17] @ login.tools.wmflabs.org [19:53:12] PROBLEM - Puppet staleness on tools-mail-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [19:53:51] oh, I see what was going on. I was connected to tools-mail, not tools-mail-01. [19:54:38] 6Labs, 10Tool-Labs: tools-mail-01 SSH check flapping - https://phabricator.wikimedia.org/T123133#1921952 (10valhallasw) 5Open>3Resolved a:3valhallasw It's the first. I was connected to tools-mail, which was (obviously) not the one suffering from connection issues. [19:54:49] I pushed an update to a tool I have up on labs which added a bit of text to the description and now it raises HTTP 400 when submitting POST [19:56:54] ResMar: roll back the change? [19:58:05] valhallasw`cloud: It's a Flask app that works on a local instance, but stopped working when posted to the server; in general how does one go about debugging that? [19:58:11] although, combined with your earlier connection issues, it suggests a network issue [19:58:20] It has to be something with the webservice, right, not with the code. [19:58:42] ResMar: if you ping tools-login from home, do you see any packet loss? [20:00:00] I'm getting 61 ping and 0.0% packet loss [20:00:15] So no :-) [20:00:43] hm. [20:01:10] I've had issues before with specific commands not running at high throughput [20:01:17] so I'm not sure which part sends the 400 -- whether it's flask or whatever is running in front (uwsgi? or wsgi-cgi?) [20:01:36] One pip I had to run for over a minute to execute [20:02:47] I'm missing the relationship between that and your webservice issue [20:03:01] Probably none [20:03:11] That's some strange issue on my side, the webservice issue is server-side [20:03:18] Um, uwsgi-python and webservice2 I believe. [20:03:21] RECOVERY - Puppet staleness on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [20:04:46] right. so add debug logging to your flask app, and try to figure out on which level this happens [20:05:05] Do app.run(debug=True) you mean? [20:05:29] I think that should log to error.log, yes [20:07:04] ResMar: eh, you're getting a HTTP/500, not a /400? in uwsgi.log [20:07:11] [pid: 30201|app: 1|req: 2/4] 10.68.21.49 () {38 vars in 729 bytes} [Fri Jan 8 19:54:33 2016] POST /signpostlab/blog-importer.html => generated 291 bytes in 393 msecs (HTTP/1.1 500) 2 headers in 84 bytes (1 switches on core 0) [20:07:43] so that's something in your code erroring out [20:08:28] 500 on the server side, 400 on the front? [20:09:37] ? [20:09:56] HTTP/400 = Bad Request, HTTP/500 = Internal Server Error [20:10:04] My web browser is displaying a 400 error, while error.log, per your paste, is logging a 500 error. [20:10:41] that's... really weird [20:10:55] Erm. After appending debug=True and restarting the service is is now...working again. [20:11:57] so... it's actually responding with an HTTP/500, but the text says 400 - Bad Request. What on earth... [20:14:12] Aha. It has to do with the proxy trying to load the error page, but because it's a POST with content-length, lighttpd blows up with an HTTP/400. That's creative. [20:14:22] ResMar: please don't forget to turn off debug again :-) [20:14:53] Let's see what happens... [20:15:14] By the way [20:15:21] Props to whoever did the login ASCII graphic [20:15:22] It's very nice. [20:16:15] I think that was coren :-) [20:17:36] valhallasw`cloud: So, to summarize, my tool was failing with an HTTP 400 that was actually an HTTP 500. [20:17:46] ResMar: yep. [20:17:52] valhallasw`cloud: I turned on debug, and it started working again. I turned debug back off, and it kept working. [20:18:06] I have no clue why that solved your 500. [20:18:32] A wizard did it. [20:18:48] ¯\_(ツ)_/¯ [20:19:48] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1921992 (10valhallasw) 3NEW [20:20:43] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922000 (10valhallasw) [20:40:54] 6Labs, 10DBA: Labs: High Replication Lag - https://phabricator.wikimedia.org/T123130#1922071 (10Aklapper) @Boshomi: How to see that problem / check that it still exists? (Also see https://mediawiki.org/wiki/How_to_report_a_bug ) [20:46:34] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1922072 (10valhallasw) [21:02:58] Yeah, the Ascii Art is mine. Old school to the last. :-) [21:06:24] Coren: I certainly appreciate it :-) [21:35:09] 6Labs: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#1922090 (10Tpt) The table is already in the maintain-replicas.pl script but it doesn't seem to be replicated on labs. See https://git.wikimedia.org/blob/operations%2Fsoftware.git/master/maintain-replicas%2Fmai... [21:37:19] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922094 (10scfc) The error is caused by the proxy: ``` scfc@tools-bastion-01:~$ curl -iH 'Host: tools.wmflabs.org' -d boo=fdsfdsafdsafda https://tools.wmflabs.org/gerrit-reviewer-bot/trigger400.php HTTP/1.... [21:39:59] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922101 (10valhallasw) Doh. I had figured that out, but then written the task description in a way that completely failed to mention it. Sorry for that :( Yes, the issue is indeed that the request is re-se... [22:08:15] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1922143 (10Boshomi) @Aklapper Yes the problem still exists. supercount shows a delay of 21954 seconds, but supercount is not the case, there are other tools for editing like https://tools.wmflabs.or... [22:15:26] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922164 (10valhallasw) With a bit of help from tcpdump: ``` GET /admin/?500 HTTP/1.1 Connection: close Host: tools.wmflabs.org X-Forwarded-Proto: https X-Original-URI: /gerrit-reviewer-bot/trigger400.php C... [22:41:57] 6Labs, 10Tool-Labs, 10DBA: Unthrottle faebot - https://phabricator.wikimedia.org/T123108#1922215 (10jcrespo) @valhallasw I told @Fae to reopen that ticket, so if anyone's, it is my fault for reopening that. I apologize (I was actually monitoring that). [22:42:05] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1922216 (10valhallasw) Fwiw, uwsgi is less squeamish than lighttpd about incorrect requests. Nginx itself also complains about the request when it receives it. It might also be possible to work around this... [22:43:16] 6Labs, 10Tool-Labs, 10DBA: Unthrottle faebot - https://phabricator.wikimedia.org/T123108#1922217 (10jcrespo) BTW, this will have to wait, because there is another process crashing labsdb hosts, and I have first to investigate that. [23:07:25] 6Labs, 10Tool-Labs, 10DBA: MySQL connections die in less than 30 seconds using tools-login tunnels - https://phabricator.wikimedia.org/T122658#1922236 (10jcrespo) p:5High>3Normal I agree that 30 seconds is too low for this timeout, I have set them to 300 seconds: ``` MariaDB TOOLSDB master localhost (n... [23:07:51] 6Labs, 10Tool-Labs, 10DBA: MySQL connections die in less than 30 seconds using tools-login tunnels - https://phabricator.wikimedia.org/T122658#1922238 (10jcrespo) a:3jcrespo [23:14:49] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1922244 (10jcrespo) This is a known issue, 2 databases have crashed in the last 2 days. I have yet to investigate the core cause but I am taking care of fixing the replication issues first. Please... [23:15:06] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1922245 (10jcrespo) a:3jcrespo [23:15:29] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1921883 (10jcrespo) p:5Triage>3High [23:26:09] 6Labs, 10DBA: Labs: High Replication Lag (s6 on c1, s1 and s5 on c2) - https://phabricator.wikimedia.org/T123130#1922252 (10jcrespo) To give an advance of the work done: Things are now getting corrected, and all replication channels are active again, but it will take some hours to reduce the replication lag b... [23:41:51] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1922274 (10Legoktm) At least Echo content should not be replicated as those tables are private info.