[00:15:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [00:55:52] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:23] PROBLEM - Puppet staleness on tools-worker-1005 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:35:03] PROBLEM - Free space - all mounts on tools-worker-1018 is CRITICAL: CRITICAL: tools.tools-worker-1018.diskspace._var_lib_docker.byte_percentfree (No valid datapoints found)tools.tools-worker-1018.diskspace.root.byte_percentfree (<100.00%) [05:03:51] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Maynich was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=881123 edit summary: [06:46:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:59:16] RECOVERY - Puppet staleness on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [3600.0] [07:26:52] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:15:20] !log tools reboot tools-webgrid-generic-1404 as locked up [12:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [12:19:36] RECOVERY - SSH on tools-webgrid-generic-1404 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [12:40:32] RECOVERY - Puppet staleness on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [3600.0] [12:40:39] 06Labs, 06Operations: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2692354 (10elukey) 05Open>03Resolved a:03elukey [13:08:52] !log shinken removing labs_debrepo from puppet for shinken-01 because the repo is empty and seemingly unused. [13:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL, dummy [13:09:39] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2692479 (10Andrew) [13:10:39] andrewbogott: we have an issue where a k8s worker 1018 is full up on disk, but I think it's legit and related to paws users [13:10:44] should be temp I think? [13:11:04] but also those are hilariously short on disk space when one paws container can take 2+G and they have 18G [13:11:10] iiuc [13:11:59] 2g is a lot! I thought that image mirroring and copy-on-write and sigh was supposed to keep things small :( [13:14:44] this is from docker ps -s which shows size [13:14:53] but I'm not sure of the internal mechanics of it [13:15:04] it may be from on the ondisk files the base image is 17G already [13:15:14] and then even the shims being small doesn't help [13:15:42] when you only have 19G to play w/ (thought 18 but it's 19) [13:16:30] yeah, seems like those paws nodes should be much bigger [13:17:06] also I thought we separated the container iamges from / onto /srv or something [13:17:09] but I guess not [14:07:46] andrewbogott: hello :} [14:08:03] * andrewbogott waves [14:08:06] andrewbogott: I have a played a bit with ldapsearch yesterday night. But we can't properly filter the search :( [14:08:18] how so? [14:08:40] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2692635 (10Andrew) [14:09:06] so in theory we can grab all hosts having "puppetClass" then have ldapsearch filter out any attributes that have a puppetClass starting with role:: [14:09:12] Would be something like: ldapsearch -LLL -x -b 'ou=hosts,dc=wikimedia,dc=org' '(& (!(dc=*.contintcloud.eqiad.wmflabs)) (&(puppetClass=*)) )' -E 'mv=(puppetClass:caseIgnoreIA5SubstringsMatch:=foo)' puppetClass [14:09:22] (-E 'mv=...' is to filter values) [14:09:53] but in our LDAP schema, puppetClass is defined as an IA5Substring, it has an EQUALITY matching rule but no SUBSTR matching rule [14:09:55] so can't filter :D [14:10:18] so in short ignore my idea of ldap search :D [14:10:31] guess some light grep/cut agains the raw data is sufficient [14:11:14] dang [14:11:34] but yeah, postprocessing seems fine [14:11:37] possibly one could update the LDAP search to add the proper SUBSTR [14:11:47] or even move out of IA5Substring which is apparently deprecated/legacy [14:11:53] but heck that is a lot of work :} [14:12:30] at least I have learned a few things about LDAP last night [14:13:43] probably not a lot of point in fixing our schema when I'm in the midst of trying to remove it :) [14:14:30] yeah [14:14:51] do you have some script to easily report instances + roles to be migrated per project? [14:15:06] nope, I'm just doing it adhoc [14:15:15] there aren't all that many classes, so it's not a huge deal so far [14:15:53] great [14:16:22] chasemp: I did the data mining for switch composer-* jobs to Nodepool. That is ~300 builds to shift to Nodepool which already does 1000 builds per day [14:16:25] or a 30% increase [14:16:33] so gotta find a way to split that in smaller chunks probably [14:16:42] ok [14:16:58] unless we feel adventurous and just move in one go [14:17:15] which is easier to handle on my side (avoid to have to craft some nasty transient config to support both case) [14:18:53] this was all on nodepool previously? [14:44:22] chasemp: yes sir [14:45:19] interesting, thanks for working through it [14:46:05] we had more builds on Nodepool than on permanent slaves [14:58:00] and I have reworked the board on https://grafana-admin.wikimedia.org/dashboard/db/continuous-integration [14:58:11] roughly 30% are on Nodepool instances right now [15:16:42] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:47:43] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:51:43] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:27:20] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2693700 (10Andrew) [16:41:29] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2693773 (10Andrew) [16:43:48] ebernhardson: is the rel.search.eqiad.wmflabs instance still in use? (It's fine if it is, I just need to rearrange the puppet config a bit.) [16:44:21] andrewbogott: yes [16:44:54] ebernhardson: ok. I'd like to wrap vagrant and vagrant::lxc in a role class for you to use there… can you suggest what that role might be called? [16:45:50] hmm, i dunno ... basically what i've done there is setup a non-mediawiki LAMP server that uses vagrant [16:45:59] so i can have it the same in labs and local dev [16:46:21] ok [16:46:35] role::labs::vagrant ? [16:46:49] oh we had that one before... [16:47:05] did it git renamed to legacy something though? [16:47:19] the name seems reasonable to me, dunno about it's history though [16:47:23] 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2693827 (10demon) a:05demon>03None [16:47:53] actually we already seem to have role::labs::vagrant_lxc [16:48:09] yeah, I just found that :) [16:48:12] which is ::vagrant + ::vagrant::lxc [16:48:26] ebernhardson: I'm going to just switch you over to using that role, should be a no-op [16:48:30] sounds like it's already done :) i suppose this is part of the horizon migration and removing custom puppet config from wikitech? [16:48:35] andrewbogott: sounds good [16:48:43] ebernhardson: yes, although several steps removed [16:50:08] heh. ebernhardson wrote role::labs::vagrant_lxc [16:50:56] * ebernhardson facepalms [16:51:07] ok, changed, looks like it was a no-op [16:51:28] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2693852 (10Andrew) [16:59:33] 06Labs, 10Labs-Infrastructure, 10DBA, 07Blocked-on-Operations, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2693879 (10chasemp) I feel comfortable that https://gerrit.wikimedia.org/r/#/c/295607/ is a replication of https://github.com/wi... [17:04:08] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2693916 (10Andrew) [17:44:27] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2694080 (10Andrew) [18:13:48] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2694183 (10Andrew) [18:19:58] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2694202 (10Andrew) [18:20:51] chasemp, which of those changes need to be made before we run the script? [18:21:38] just the cleaning up of configured views vs. production views? [18:26:30] Krenair: so in theory if everything matched and there was consistency we could run it now but I'm not convinced there are not more lurking production drifts that would be clobbered, so I'm not sure need to think on it and work through a few things. I'm wondering if we don't ride out the current DB's and use this on the new ones only. [18:26:38] no solid plan yet beyond some testing [18:26:45] afk for a bit tho :) [18:26:48] matt_flaschen: May I delete editor-campaigns.editor-engagement.eqiad.wmflabs? It looks like puppet has been broken there for a while. [18:27:51] Krenair: if we had a new wikidb I would use this against it atm [18:28:04] we do have one waiting to get views chasemp [18:28:07] tcywiki [18:28:12] and wikimania2017wiki [18:29:14] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2694242 (10Andrew) [18:30:52] godog: can I delete filippo-test-jessie.monitoring.eqiad.wmflabs ? [18:31:00] andrewbogott, yeah, that should be all right. Nothing in RC, nor have I heard about it. S was probably using it long ago to test Campaigns. [18:31:10] matt_flaschen: great, thanks [18:31:43] matt_flaschen: anything else in that project that I can delete while I'm at it? [18:32:16] flow-tests, docs, ee-flow-extras, ee-flow, mwui — all >2 years old. [18:33:10] hm, looks like there's been some cleanup there already [18:34:13] andrewbogott, keep docs please, for mwui ask UI Standarization team, rest not right now. [18:34:24] ok, thanks [18:44:22] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2694316 (10Andrew) [18:48:35] 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2694339 (10mmodell) a:03mmodell [18:49:05] kart_: I have a question about language-lcmd.language.eqiad.wmflabs if you are around… it uses the puppet class mediawiki::packages (which no longer seems to actually work on that instance.) [18:49:17] I'm wondering if you still need those packages, or if that instance is defunct, or what... [18:49:22] 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2182293 (10mmodell) I'm doing some work on the labs role in https://gerrit.wikimedia.org/r/#/c/313937/ [19:11:44] andrewbogott: any chance you get to https://phabricator.wikimedia.org/T147013 in the near future ? [19:12:43] matanya: shouldn't be too long… we missed our quota review meeting this week due to travel [19:13:08] andrewbogott: did you enjoy barcelona at least ? :) [19:13:44] yes! [19:13:47] Busy but very nice there. [19:14:41] the food in that Manila place was amazing [19:23:31] glad to hear you enjoyed it, you deserved some pleasant time [19:23:59] ops offsite = 5-10 pounds weight gain, lol [19:24:04] do you know where quarry runs? [19:24:47] 06Labs, 06Operations, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2694538 (10madhuvishy) a:03madhuvishy [19:26:06] i keep forgetting the right way to build "watroles" URL [19:26:08] https://tools.wmflabs.org/watroles/role/ ... [19:26:19] cant find it or doing it wrong again [19:26:31] role/role: i think it was [19:26:52] oh, yes https://tools.wmflabs.org/watroles/role/role::labs::quarry::web got it, nevermind :) [19:37:20] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:46:47] i assume this is normal [19:46:49] Notice: /Stage[main]/Base::Labs/User[root]/password: changed password [19:47:51] !log quarry merged gerrit 308313 - should definitely be no-op, but noticed that puppet is disabled on quarry-main-01 [19:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL, Master [19:48:07] !log quarry quarry-runner-01 has a problem starting exim4 [19:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL, Master [20:00:27] mutante: yeah that password thing is normal. It has a bad guard condition or something and applies on every run [20:00:36] bd808: ok, thanks [20:09:17] Krenair: is there a task out there somewehre for tcywiki pending? [20:10:08] chasemp, https://phabricator.wikimedia.org/T142223 [20:10:45] thanks [20:12:22] 06Labs: Make user_email_authenticated status visible on labs - https://phabricator.wikimedia.org/T70876#2694857 (10AlexMonk-WMF) a:05coren>03None [20:15:03] 06Labs, 10Labs-Infrastructure: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#2694890 (10AlexMonk-WMF) a:05AlexMonk-WMF>03Andrew [20:18:34] 06Labs, 10Labs-Infrastructure, 10DBA, 07Blocked-on-Operations: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2694905 (10AlexMonk-WMF) a:05AlexMonk-WMF>03chasemp Chase is working on figuring out what else we need to do before we can run the script. https:/... [20:39:27] PROBLEM - Puppet run on bdsync-deb is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:49:47] 10Tool-Labs-tools-Pageviews: Add option show moving average line over chart - https://phabricator.wikimedia.org/T147515#2695043 (10MusikAnimal) [21:51:13] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2695228 (10Andrew) [22:49:26] RECOVERY - Puppet run on bdsync-deb is OK: OK: Less than 1.00% above the threshold [0.0] [23:00:27] PROBLEM - Puppet run on bdsync-deb is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:04:50] * Krinkle is fighting with php/lighttpd in labs to try and find a way to disable gzip and buffering so that early flush in php works as expected. [23:05:04] So far: .lighttpd.conf: compress.filetype = () [23:05:22] public_html/.user.ini: zlib.output_compression = Off [23:05:28] and in PHP itself, all ob_ reset [23:05:40] And yet, still being buffered entirely, and still being gzipped [23:05:40] lighttpd does some fcgi output buffering too. I'm not sure if that's easy to turn off [23:05:53] https://tools.wmflabs.org/krinkle-redirect/fauxtimeline/?html=chunk [23:06:10] and you have another layer of nginx to pass through that you can't touch [23:10:35] nothing obvious in https://redmine.lighttpd.net/projects/1/wiki/docs_modfastcgi other than sendfile support [23:13:23] deflate.enabled is not supported (unknown module, deflate not loaded) [23:13:24] compress.allowed-encodings = () [23:13:24] compress.filetype = () [23:13:37] these are supported, but don't seem to have any effect. [23:13:45] So either it's sitll buffering or it's nginx doing it [23:15:54] * bd808 looks for the nginx config [23:15:58] bd808: Setting in within PHP seems to work [23:15:59] header('X-Accel-Buffering: no'); [23:16:12] I'll see if I can set that from lighttpd instead of in my php code [23:16:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:17:50] Krinkle: this is the nginx config -- https://github.com/wikimedia/operations-puppet/blob/production/modules/dynamicproxy/templates/nginx.conf [23:18:10] "gzip on;" [23:18:34] Seems to work now :) [23:18:35] https://tools.wmflabs.org/krinkle-redirect/fauxtimeline/?html=chunk& [23:19:29] cool. the other part of the nginx config is -- https://github.com/wikimedia/operations-puppet/blob/production/modules/dynamicproxy/templates/urlproxy.conf [23:20:39] https://github.com/Krinkle/fauxtimeline/commit/2080bdbcb16 [23:20:54] Yeah [23:21:00] gzip is still there after this but seems fine [23:21:07] I think it knows not to hold back too long [23:28:00] hacky hack hack [23:31:22] Krinkle: you should write something on wikitech about the adventure or at least the current solution [23:56:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]