[00:35:02] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 07Tracking: Issues with 'webservice' kubernetes backend - https://phabricator.wikimedia.org/T139107#2489156 (10yuvipanda) [00:35:04] 10Labs-Kubernetes: Kubernetes does not mount shared path - https://phabricator.wikimedia.org/T141098#2489154 (10yuvipanda) 05Open>03Resolved Cool! I've documented that it is available in /data/project/shared in https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web/Kubernetes [01:56:19] !log tools deploy kubernetes v1.3.3wmf1 [01:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [02:04:35] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:21:38] 06Labs, 10Labs-Infrastructure, 10DBA: labsdb* has no High Availability solution - https://phabricator.wikimedia.org/T141097#2489261 (10Danny_B) [02:21:48] 06Labs, 10Labs-Infrastructure, 10DBA: Having lots of accounts with separate grants makes auditing difficult. - https://phabricator.wikimedia.org/T141096#2489262 (10Danny_B) [02:22:01] 06Labs, 10Labs-Infrastructure, 10DBA: Users can't run EXPLAIN queries to check the theoretical efficiency of their SQL - https://phabricator.wikimedia.org/T141095#2489263 (10Danny_B) [02:39:37] RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [03:51:50] 06Labs, 10Labs-Infrastructure, 06Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2489274 (10chasemp) I had to reboot seaborgium today as it froze up and took out ldap with it. > !log gnt-instance reboot seaborgium.wikimedia.org I would say...definitely something is sti... [03:54:26] PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:54:31] PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:54:49] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [03:55:01] PROBLEM - Puppet run on tools-worker-1008 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:55:04] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:55:45] uh [03:56:28] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [0.0] [03:56:30] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:40] PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [03:56:41] PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:46] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:56:46] PROBLEM - Puppet run on tools-merlbot-proxy is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:48] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:56:48] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0] [03:56:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:51] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:55] PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:56:59] PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:57:03] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:57:08] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:57:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:57:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:57:15] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:57:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:57:17] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:57:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:57:31] chasemp ^ just side effects of the ldap outage, verified it is fine now. [03:57:36] puppet is going to take a dive w/ ldap having been down and should recover here [03:57:45] yeah ok cool I was doin gthe same [03:57:54] this is getting to be a serious thing [03:58:37] chasemp yeah... [03:59:09] PROBLEM - Puppet run on tools-exec-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:11] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:13] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:17] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:17] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [03:59:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:25] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:25] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [03:59:25] PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:26] PROBLEM - Puppet run on tools-exec-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:29] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:30] PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [03:59:30] PROBLEM - Puppet run on tools-exec-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:59:37] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:38] PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:38] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:59:40] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [03:59:40] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [03:59:42] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:42] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:52] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:59:52] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:59:52] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:59:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:00:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:00:02] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [04:00:02] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [04:00:02] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [04:00:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:00:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:01:58] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [04:02:16] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:32] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:32] RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:52] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [04:04:52] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [04:05:04] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [04:05:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [04:05:58] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [04:06:29] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:37] RECOVERY - Puppet run on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:47] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:48] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [04:06:55] RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [04:07:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [04:07:14] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [04:07:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:06] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Omidfi was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=781262 edit summary: [04:09:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:09:22] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:23] RECOVERY - Puppet run on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:26] RECOVERY - Puppet run on tools-worker-1005 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:27] RECOVERY - Puppet run on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:41] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:41] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:41] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:42] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:10:04] RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [04:10:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:11:46] RECOVERY - Puppet run on tools-merlbot-proxy is OK: OK: Less than 1.00% above the threshold [0.0] [04:12:12] RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0] [04:14:09] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [04:14:19] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:14:31] RECOVERY - Puppet run on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [04:17:03] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:19:25] RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [04:19:26] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [04:19:52] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:20:04] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [04:21:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [04:21:51] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [04:22:09] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:13] RECOVERY - Puppet run on tools-webgrid-generic-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:14] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:15] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:18] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:36] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:50] RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:52] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:00] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:58] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [04:26:42] RECOVERY - Puppet run on tools-web-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:27:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [04:29:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [04:29:39] RECOVERY - Puppet run on tools-web-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:29:41] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [04:30:01] RECOVERY - Puppet run on tools-worker-1008 is OK: OK: Less than 1.00% above the threshold [0.0] [04:30:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [04:31:31] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [04:31:39] RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:31:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [04:31:47] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [04:34:17] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [04:34:49] RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:20:26] !log mediawiki-core-team deleted urlshortener instance [05:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mediawiki-core-team/SAL, Master [05:20:51] bd808: ^ fyi [06:25:24] legoktm did you get php7 setup? [06:26:28] YuviPanda: heh no. I was going to work on it a few days ago except the labs instance was busted and needed a reboot and I spent all of my time realizing that and was not motivated to actually work on it -.- [06:27:06] sounds not unfamiliar unfortunately :( [09:03:33] 06Labs, 10Labs-Infrastructure, 10DBA: labsdb* has no High Availability solution - https://phabricator.wikimedia.org/T141097#2489365 (10jcrespo) > bd808 changed the title from "labsdb* has no HA solution" to "labsdb* has no High Availability solution". I do not like much this title. I think we all understand... [09:10:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:17:36] PROBLEM - SSH on tools-k8s-etcd-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:00] Hi! Are there currently any known problems regarding creating new tool projects? I tried to create one half an hour ago and I still cannot access it. However, trying to register one with the same name also fails. Any ideas? [09:33:37] Yellowcard: I think LDAP went out earlier today [09:35:15] tom29739: ah, that expalins it. Is it still out and/or is there a phab bug? [09:50:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [10:11:21] 06Labs, 10Tool-Labs: Created tool does not show up, re-creation impossible - https://phabricator.wikimedia.org/T141178#2489411 (10Yellowcard) [10:39:00] Hi! How do I make a cronjob so silent, that there are no cron status lines in the .out-file such as this: [10:39:03] [Sat Jul 23 10:20:15 2016] there is a job named 'portal-db' already active [10:42:15] Yellowcard: I'm not sure [10:43:29] tom29739: alright, I failed a bug in phabricator. Let's see what the sysops find out, the tool still seems to be stuck somewhere [10:44:22] There were some messages about it at about quarter to five my time this morning: "just side effects of the ldap outage, verified it is fine now", "puppet is going to take a dive w/ ldap having been down and should recover here" [10:45:10] Yellowcard: what's the error message you get when you try and become it? [10:45:14] (if any) [10:45:56] trying to become it returns "project doesn't exist" (or something like that), but trying to create one with this name simply returns "creation failed" [10:46:58] Weird [10:47:00] however, the first time I created it returned "creation successful" (or similar) [10:47:13] When did you try and create it? [10:47:37] puh, let's say 2 hours ago, maybe a little less [10:48:12] actually, my second weird tool labs problem within 24 hours :D [10:48:41] trying to become it returns: "become: no such tool 'eulenwiki'" [10:49:01] Gimme a sec [10:49:08] I'm wondering whether I should create a second one with a different name, but I don't want to create an even bigger mess [10:49:13] Wonder whether the LDAP user was created [10:50:54] Yellowcard: it doesn't appear in the great big list [11:07:50] tom29739: mhm, would you suggest trying to create a new job, then, or rather wait? [11:27:12] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:37:58] Yellowcard: it's as if the user was never created [11:38:19] Yellowcard: try creating the tool again [11:38:27] tom29739: just that I cannot re-create the tool [11:38:30] OK, I will! [11:38:53] It's like the tool was only partially created [11:40:54] tom29739: "Failed to create service group." [11:41:50] I wonder... [11:42:26] RECOVERY - SSH on tools-k8s-etcd-02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [11:48:54] Yellowcard, that's weird [11:49:08] It shows up in the big list of service groups [11:49:16] https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup [11:49:19] 10PAWS: Paws display 502 - Bad gateway error - https://phabricator.wikimedia.org/T140578#2489524 (10Ivanhercaz) @yuvipanda Since yesterday I tried to work in my bot but when I follow the instructions ―clean cookies, logout and sign in again―, it works a few minutes and then the "502 - Bad Gateway error" happens... [11:49:31] Select the tools project and go to the very bottom [11:50:03] right. I'll try to remove it and then to re-create it, maybe that helps it out [11:50:30] mhm, trying to remove it returns "You must be a member of the projectadmin role in project tools to perform this action." [11:51:58] Yeah, I can't get rid [11:52:02] Nor can you [11:52:14] You'll need a tools admin [11:54:27] PROBLEM - SSH on tools-k8s-etcd-02 is CRITICAL: Server answer [11:59:27] RECOVERY - SSH on tools-k8s-etcd-02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [12:04:53] tom29739: do you know who to contact best regarding this issue? [12:06:28] Yellowcard, a tools admin [12:07:03] Yellowcard, it's a Saturday, so I don't think there's many around [13:38:42] PROBLEM - SSH on tools-grid-master is CRITICAL: Server answer [13:57:28] 06Labs, 10Tool-Labs: Disable the sumdisc job of AsuraBot - https://phabricator.wikimedia.org/T140909#2489655 (10Luke081515) p:05Triage>03High The Bot is still spamming. Can someone handle that fast please? [13:58:41] RECOVERY - SSH on tools-grid-master is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [14:04:40] PROBLEM - SSH on tools-grid-master is CRITICAL: Server answer [14:14:28] 06Labs, 10Tool-Labs: i18n for https://tools.wmflabs.org/ - https://phabricator.wikimedia.org/T141182#2489670 (10Steinsplitter) [14:22:53] 06Labs, 10Tool-Labs, 13Patch-For-Review: Convert most top level tool and bastion dns redcords to CNAMEs - https://phabricator.wikimedia.org/T131796#2178815 (10AlexMonk-WMF) > The latter is in a special org that requires commandline access on labcontrol1001 to manipulate. This makes it unusable for tools admi... [14:26:01] 06Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org, 13Patch-For-Review: Allow wikitech to write files - https://phabricator.wikimedia.org/T126628#2019112 (10AlexMonk-WMF) I understand a bit more about how the file backends work these days, and I think this would make wikitech attempt to connect to... [14:29:07] 06Labs, 10Tool-Labs, 07I18n: Internationalize Tool Labs' homepage - https://phabricator.wikimedia.org/T105590#2489715 (10Bugreporter) [14:29:09] 06Labs, 10Tool-Labs: i18n for https://tools.wmflabs.org/ - https://phabricator.wikimedia.org/T141182#2489717 (10Bugreporter) [14:30:43] 06Labs, 10Tool-Labs, 07I18n: Internationalize Tool Labs' homepage - https://phabricator.wikimedia.org/T105590#2489718 (10Luke081515) p:05Lowest>03Triage [14:37:11] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Support reverse dns for public labs IPs - https://phabricator.wikimedia.org/T104521#2489734 (10AlexMonk-WMF) a:03AlexMonk-WMF [15:55:17] 10:39 < doctaxon> Hi! How do I make a cronjob so silent, that there are no cron status lines in the .out-file such as this: [15:55:20] 10:39 < doctaxon> [Sat Jul 23 10:20:15 2016] there is a job named 'portal-db' already active [15:58:16] doctaxon: jsub with -quiet maybe? [16:03:30] jsub with -quiet verwende ich ja [16:03:51] jsub with -quiet does not help [16:33:08] Could someone make grrrit-wm talk again? [16:34:37] 10Labs-Kubernetes: Odd kubernetes error - https://phabricator.wikimedia.org/T141041#2489800 (10Magnus) I'll have a look at it this weekend, maybe I can code around it in PHP. Then I could continue to use the "default" container. [16:39:30] Glaisher: it isn't talking? [16:39:32] Weird [16:40:56] tom29739: It hasn't shown any change for hours [16:51:06] I think it needs a reboot [16:52:31] legoktm: YuviPanda ^ [17:50:36] Glaisher: would kicking it work? [17:50:54] Don't know much about grrrit-wm [17:51:01] I don't think so.. [17:51:34] Of course, it could be that there hasn't actually been a gerrit change [17:51:44] But I find that unlikely [18:01:37] 10Tool-Labs-tools-Other: https://tools.wmflabs.org/merlbot-web/ 404s - https://phabricator.wikimedia.org/T85739#2489834 (10bd808) [18:01:39] 10Tool-Labs-tools-Other, 07Tracking: merl tools (tracking) - https://phabricator.wikimedia.org/T69556#2489833 (10bd808) [19:01:19] !log tools.lolrrit-wm Killed pod grrrit-wm-230500525-ze741 [19:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [19:15:27] bd808: currently here? [19:15:34] *still here [19:15:42] Hi Luke081515 [19:15:45] hi :) [19:16:02] can you tell my which s5... user merlbot is? [19:16:56] a user from de had an idea, we can take a look, IIRC merlbot writes own dbs, and if we can access them, it would be possible to take over the write jobs from merlbot, so we can create a temporary workaround [19:17:18] but for that I need to know, how the dbs are organised, and if it would be possible to access them [19:17:19] Luke081515: Run `id tools.merlbot` on one of the tools bastions [19:17:38] ah, thx :) [19:22:08] 10Labs-Kubernetes: Odd kubernetes error - https://phabricator.wikimedia.org/T141041#2490008 (10yuvipanda) @Magnus thank you :D You can also go to https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-pods (login with your wikitech username/password, will be made more open soon), and select your tool n... [19:23:42] bd808: I got another question :). Are you able to look at s51127__temp_transient? It's the only DB from merlbot, I can found at the toolsdb. (I will search at the replicas too now), can you take a look at the schema? [19:25:16] 06Labs, 10Tool-Labs: Disable the sumdisc job of AsuraBot - https://phabricator.wikimedia.org/T140909#2490010 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done [19:25:23] YuviPanda: thx [19:27:57] YuviPanda: that labs grafana, I can't log into it [19:28:12] Nor can I use any LDAP console utilities [19:28:28] I think it's because I have 2FA enabled [19:28:35] hi tom29739 [19:28:41] tom29739: do you know how many labsdbs do we have, and if there is another DB than these and toolsdb? [19:28:41] But I need it enabled to use Horizon [19:28:43] are you using your shell name or wikitech username? [19:28:57] I too have 2FA enabled and can log in to all these things [19:29:22] YuviPanda: it's case sensitive isn't it? [19:29:36] * tom29739 facepalms [19:29:49] Luke081515: I think only s2 and s3 are active right now [19:29:50] the username? yeah [19:30:34] bd808 c1 and c3 (equivalent to labsdb1001 and 1003). s* are prod slices not entirely relevant to labs anymore [19:30:48] I wonder, because I just found one database for merlbots various entrys. the problem is, that I don't have read for s51127__temp_transient, si I can't take a look, if I can get merlbots data from these tables :-/ [19:30:57] *so [19:31:36] I think without help from merl that reviving that bot is a lost cause :/ [19:32:09] there's no code [19:32:10] so... [19:32:57] I'm close to adding per-tool http request counts to graphite :D [19:33:16] I think programms to decompile things exist? But I'm not good in programming java... [19:33:37] working with decompiled java is not fun [19:33:41] no comments [19:33:50] You'd need to be a Java expert [19:33:57] and depending on how it was compiled the symbols may be garbage [19:34:07] but at least, we maybe can findout the hardcoded DB-tables [19:34:20] I *am* a java expert but I wouldn't do it [19:34:21] maybe... [19:34:25] hm, ok [19:34:55] there are 12 tables in that schema. t, t2 and 10 taxa_* tables [19:35:05] * tom29739 does not decompile programs when he can help it [19:35:11] * YuviPanda personally recommends writing from scratch with pywikbot and a proper VCS + CI, involving multiple people [19:35:36] and not depending on array jobs and other OGE features [19:35:45] * YuviPanda likes how JeanFred and Lokal_Profil are handling the heritage API :) [19:36:49] bd808: is it possible do get read access to that DB? maybe I can find useful information from that DB, I can use [19:37:14] The process for that is to ask permission from the owner... [19:37:36] since it isn't named *_p [19:37:53] There might be private data in there [19:38:06] bd808: but I think WMDE-Fisch is added to the tool too, without asking the owner? [19:38:44] YuviPanda: that's a good idea [19:38:50] at least a DESC for each table would be enough for the first moment [19:38:59] Luke081515: then they are an owner and can approve I guess [19:39:05] * tom29739 hasn't really looked into testing or CI [19:39:09] so that I can see how the tables are set up, without seing the data in there [19:41:51] bd808: is it possible to look at the output of the DESC command of each table? [19:42:37] Luke081515: https://phabricator.wikimedia.org/P3564 [19:42:43] thx :) [20:10:03] bd808: I contacted someone of WMDE now, so that they can take a look, if it seems possible (the user I mailed has access to the tool) [20:11:19] (just FYI) [20:20:08] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Thathanka was created, changed by Thathanka link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Thathanka edit summary: Created page with "{{Tools Access Request |Justification=Individual wiki maintenance |Completed=false |User Name=Thathanka }}" [20:20:31] YuviPanda: Still here? [20:20:46] bd808: or you? [20:21:00] what's up Luke081515 [20:21:37] bd808: sry, I forgot to mention it at T140909, the actual running job needs to be killed too. It's sum_disc with ID 8965678, running at tools-exec-1404 [20:21:38] T140909: Disable the sumdisc job of AsuraBot - https://phabricator.wikimedia.org/T140909 [20:23:50] !log tools.asurabot killed running job for sum_disc per T140909 [20:23:51] T140909: Disable the sumdisc job of AsuraBot - https://phabricator.wikimedia.org/T140909 [20:24:28] bd808: thank you very much :) [20:25:48] yw [20:28:54] * bd808 -> new star trek movie [20:51:13] bd808 you watch star trek movies [20:54:01] 06Labs, 10Tool-Labs: Track web request stats for each tool on tool labs - https://phabricator.wikimedia.org/T69880#2490058 (10yuvipanda) [20:56:47] 06Labs, 10Tool-Labs, 13Patch-For-Review: Track web request stats for each tool on tool labs - https://phabricator.wikimedia.org/T69880#2490063 (10yuvipanda) a:03yuvipanda [21:22:28] PROBLEM - SSH on tools-k8s-etcd-02 is CRITICAL: Server answer [21:23:03] kaldari: YuviPanda: section-redirect doesn't exist. [21:24:08] ? [21:24:58] kaldari: whoops. I meant to ping ksft [21:25:57] yeah, what kind of person starts their name with a K? [21:26:14] only I get to do that [21:26:24] Hmm...good question. What person starts their name with a k? [21:26:34] :p [21:27:15] ksft: be careful kaldari is with WMF. He will run you over. :D [21:27:21] only crazy people who probably spend hours setting up a way to get the program they wrote to maintain an online encyclopedia to run regularly [21:27:26] RECOVERY - SSH on tools-k8s-etcd-02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [21:28:50] ksft: now he will definitely run you over. :p [21:29:03] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2490102 (10yuvipanda) Happened to tools-k8s-etcd-02 again. [21:29:45] I mean, uh, isn't K great? [21:29:58] lol [21:31:05] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2490103 (10yuvipanda) ``` [140280.197047] INFO: task sshd:20819 blocked for more than 120 seconds. [140280.198328] Not tainted 4.4.0-1-amd64 #1 [140280.198794] "... [21:31:32] ksft: try logging out and logging back in again. [21:31:36] okay [21:31:45] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2490104 (10yuvipanda) It's in labvirt1001 now. [21:31:56] still not working [21:32:15] TimStarling: ^ [21:34:20] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2490106 (10yuvipanda) I'm migrating it to labvirt1013 to see if that helps, since this node has died previously too. [21:36:17] PROBLEM - Host tools-k8s-etcd-02 is DOWN: CRITICAL - Host Unreachable (10.68.18.64) [21:45:22] RECOVERY - Host tools-k8s-etcd-02 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [21:50:54] 06Labs: Two small instances: for WikiToLearn development - https://phabricator.wikimedia.org/T115282#2490135 (10Toma.luca95) >>! In T115282#2481251, @chasemp wrote: >>>! In T115282#1771743, @Toma.luca95 wrote: >> It is possible have a proxy for *.wikitolearn.org/*.wiki2learn.org domains and subdomains with webso... [21:52:31] PROBLEM - Puppet staleness on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [21:52:48] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2490138 (10yuvipanda) I see ``` [Sat Jul 23 21:44:05 2016] HTB: quantum of class 10002 is big. Consider r2q change. ``` repeated a bunch of times in dmesg, doesn't... [21:57:31] RECOVERY - Puppet staleness on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [22:05:24] Change on 12www.mediawiki.org a page Wikimedia Labs was modified, changed by Shirayuki link https://www.mediawiki.org/w/index.php?diff=2199573 edit summary: [+4] translation tweaks [22:26:42] 10Labs-Kubernetes: Odd kubernetes error - https://phabricator.wikimedia.org/T141041#2490190 (10yuvipanda) (URL at https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats now) [22:49:52] I can't `become` my tool account. [22:51:58] account names? [22:52:22] ksft: yellowcard was having that issue earlier [22:52:36] Krenair: ksft [22:52:44] the tool is section-redirect [22:53:06] Wikitech username is KSFT [22:53:18] seems I'm not much use at the moment, SSH doesn't want to work on my connection [22:53:36] high packet loss [22:54:14] Krenair: I reckon it's the same issue as before [22:54:22] what was the problem? [22:54:41] I just created the tool, by the way. I probably should have mentioned that before. [22:54:44] The service group gets created: it appears in https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup [22:55:01] I created it a little over an hour and a half ago. [22:55:02] The tool doesn't appear in the big list of tools [22:55:07] I noticed that. [22:55:33] And you can't become it, it says the tool does not exist [22:55:55] Krenair: do you know why a tool would be only partially created? [22:56:06] to be honest I don't know much about our service group system [22:56:10] I know there was an LDAP outage last night [22:56:16] But that's been resolved [22:56:31] yeah, that probably wouldn't be relevant [22:57:20] dn: cn=tools.section-redirect,ou=servicegroups,dc=wikimedia,dc=org [22:57:26] member: uid=ksft,ou=people,dc=wikimedia,dc=org [23:01:31] rather unhelpfully you can't read sudoers files without being root, and I don't currently have admin in that project [23:05:24] Krenair: I told yellowcard: it's a Saturday [23:05:56] There aren't likely to be many around unfortunately [23:06:38] ksft, it says no such tool? [23:09:19] ksft, try `sudo -niu tools.section-redirect" [23:17:20] LDAP has the group and you as a member, `become` won't work because /data/project/section-redirect doesn't exist - this might also be why it doesn't appear on the list [23:17:33] I'm not yet sure what creates those directories [23:21:37] !log tools restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation [23:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:22:51] 06Labs, 10Tool-Labs: Add appropriate timeouts to maintain-kubeusers - https://phabricator.wikimedia.org/T141203#2490212 (10yuvipanda) [23:22:58] krenair ^ was the cause [23:23:06] ksft Yellowcard your tools should work now [23:24:36] 06Labs, 10Labs-Infrastructure: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446#2490225 (10yuvipanda) 05Open>03Resolved a:03yuvipanda [23:24:55] * YuviPanda goes afk again [23:25:15] tom29739 I've renamed dashboard again, and it has web request stats as well! https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats [23:25:39] I tried it [23:25:44] It showed nothing [23:26:12] OK... [23:26:21] It seems to be working now [23:26:25] I fixed it up again, been fiddling with it [23:26:34] Though nothing shows for web requests [23:26:50] web requests are being recorded only in the last few hours [23:27:00] so maybe there have been no webrequests to that tool in that time? [23:27:49] maintain-kubeusers is responsible for creating servicegroup dirs under /data/project? [23:28:16] YuviPanda: I made some web requests [23:28:19] It works [23:28:24] I guess it could be worse - I was looking at labstore's create-dbusers :) [23:28:27] :) [23:29:33] krenair used to be worse, there was a bash script running in a loop [23:29:38] dirpath = os.path.join('/data', 'project', user.name, '.kube') [23:29:41] os.makedirs(dirpath, mode=0o775, exist_ok=False) [23:29:42] :/ [23:29:54] God, I bet that was unreliable [23:29:59] this meant there were three things racing - the bash script (toolswatcher), create-dbusers and maintain-kubeusers [23:30:02] (the bash script) [23:30:10] so I folded two of them together [23:30:35] YuviPanda: I like this one: https://grafana-labs-admin.wikimedia.org/dashboard/db/tools-activity [23:30:48] krenair you want create_homedir [23:30:53] Though it doesn't show the web requests [23:31:07] YuviPanda, lovely [23:31:22] "N/Areq/min [23:31:25] " [23:31:28] tom29739 yeah... [23:32:12] tom29739 graphite labs is set to 'proxy mode' which I guess causes some of the problems. I'll try to investeigate moving it to 'direct' mode next week [23:32:13] should be less flaky [23:32:23] krenair yeah... incrementally less shitty tho [23:32:56] I just went afk for a while. What was the answer? [23:32:59] It should work now? [23:33:14] oh, it does [23:33:45] YuviPanda, so is this now on the list of things that break whenever ldap does? [23:33:58] unexpected, if things work, like originally planned \o/ [23:35:10] krenair yeah [23:35:21] the problem was that it didn't recover when ldap did [23:36:03] YuviPanda: what's the difference between direct and proxy mode? [23:36:33] tom29739 direct mode will make request to graphite-labs.wikimedia.org directly from your browser [23:36:54] proxy mode makes it to grafana-labs-admin.wikimedia.org, which then makes a request from there to graphite-labs.wikimedia.org [23:38:16] YuviPanda: why wasn't it in direct mode in the first place? [23:38:34] because it doesn't work :) [23:38:38] and I haven't had time to investigate yet [23:38:56] krenair https://gerrit.wikimedia.org/r/#/c/300741/ should make it better [23:40:04] alright, now I go for realz [23:40:07] ttyl! [23:41:03] * Krenair waves