[00:17:35] Yippee, build fixed! [00:17:36] Project selenium-Flow » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #208: 09FIXED in 1 min 35 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/208/ [00:17:58] Yippee, build fixed! [00:17:59] Project selenium-Flow » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #208: 09FIXED in 1 min 58 sec: https://integration.wikimedia.org/ci/job/selenium-Flow/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/208/ [00:20:20] Project beta-update-databases-eqiad build #12827: 04STILL FAILING in 19 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12827/ [00:25:38] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2797723 (10Mattflaschen-WMF) All the steps are done, except: * Merge the last two changes above (Parsoid and RESTBase, they are automatically deploye... [00:46:11] twentyafterfour: still here? what's the rationale for https://phabricator.wikimedia.org/D448 ? I'm asking because I see version.py would be committed with 3.3.1 [00:55:51] hrm? version.py shouldn't be in 3.3.1...right? [00:56:42] should be where the tag is: https://github.com/wikimedia/scap/tree/80c0dd09ababa3c3e762868513a0ef1dce6db9f3 [01:00:43] Completely untested, but should be fun :) https://phabricator.wikimedia.org/D454 [01:04:45] neat [01:20:18] Project beta-update-databases-eqiad build #12828: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12828/ [01:21:05] thcipriani: could be an error, I'll flag it [02:20:21] Project beta-update-databases-eqiad build #12829: 04STILL FAILING in 20 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12829/ [02:45:19] godog: the rational is to be able to see which version is installed, especially useful on beta [02:47:25] twentyafterfour: *nod* see also the comments about version.py [02:48:28] godog: I set it up so that the debian packaging scripts will update the version before packaging [02:48:35] based on the newest debian/changelog [02:50:13] I could have it use `git describe` for the version when running it from a dev environment [02:52:01] yup that'd work too, but yeah not sure how feasible it is what I suggested, I don't have a python debian package in mind that does it [02:52:31] godog: I think it's quite feasible [02:53:12] here's how I updated the version in the packaging script: https://phabricator.wikimedia.org/D448#d2b2229a [02:54:16] 10Continuous-Integration-Infrastructure (phase-out-gallium), 10releng-201617-q1, 06Operations, 10ops-eqiad: decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2797918 (10Dzahn) a:05Dzahn>03None [02:55:01] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10ops-eqiad: decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) [02:55:18] 10Continuous-Integration-Infrastructure (phase-out-gallium), 06Operations, 10ops-eqiad: decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) p:05High>03Normal [02:56:14] twentyafterfour: yeah that has the problem ostriches was mentioning in the comments [02:56:27] anyways we can chat on the review too, I have to run! [03:20:15] Project beta-update-databases-eqiad build #12830: 04STILL FAILING in 15 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12830/ [04:20:18] Project beta-update-databases-eqiad build #12831: 04STILL FAILING in 18 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12831/ [05:20:16] Project beta-update-databases-eqiad build #12832: 04STILL FAILING in 16 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12832/ [05:44:27] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:48:29] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:49:25] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:56:03] 03Scap3, 10ContentTranslation-CXserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 3 others: Enable Scap3 config deploys for CXServer - https://phabricator.wikimedia.org/T147634#2798041 (10KartikMistry) @mobrovac Ping! [05:58:53] I'm looking to see what's going on [06:01:31] logstahs says permission error [06:20:04] Project beta-update-databases-eqiad build #12833: 04STILL FAILING in 3.9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12833/ [06:20:35] twentyafterfour: around? [06:21:20] PROBLEM - App Server Main HTTP Response on deployment-mediawiki06 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50409 bytes in 1.203 second response time [06:25:40] Amir1: yo [06:25:59] twentyafterfour: It seems the whole beta cluster is down [06:26:04] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences#mw-prefsection-betafeatures [06:26:26] logstash says it get permission error [06:26:48] https://logstash-beta.wmflabs.org/ to make image [06:27:15] Amir1: looking into it [06:27:29] Thanks, tell me if I can help [06:27:46] i logged in on deployment-mediawiki06 and see apache is running but .. [06:28:03] cant connect to backend [06:30:09] hhvm crashed [06:30:15] fatal error: stack overflow [06:30:46] !log restarting hhvm on deployment-mediawiki06 [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:31:39] hhvm[22362]: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff....line 754 [06:32:06] and now it says it is running again (on that instance) [06:32:19] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50624 bytes in 1.503 second response time [06:32:32] okay, so. Let's restart hhvm on nodes [06:33:26] !log ladsgroup@deployment-mediawiki05:~$ sudo service hhvm restart [06:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:33:38] on deployment-mediawiki05 it says the status of hhvm is running and no such error [06:33:45] even though shinken-wm just told us the above [06:33:50] ah [06:33:56] I just restarted it [06:34:02] yep:) [06:34:04] maybe that's it [06:34:15] !log restarting hhvm on deployment-mediawiki04 [06:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [06:34:46] hey it's back [06:34:56] Now it says db is locked [06:35:10] hmm [06:38:32] * twentyafterfour doesn't know the mariadb password [06:41:49] I guess it goes down again [06:41:50] yea, uhm, i also dont really know about the db [06:42:10] stackoverflow error [06:55:01] FSFileBackend::doPrepareInternal: cannot create directory /srv/mediawiki/php-master/images/thumb/9/9d/Commons-... [06:55:38] "Server deployment-db04 (#1) is not replicating?" "All replica DBs lagged. Switch to read-only mode" [06:55:55] I can't figure out how to get into the slave db [06:57:22] ssh to the machine and sudo -i mysql? [06:58:31] oh, nope, apparently that's not set up [07:00:09] lots of confd.service errors [07:06:19] twentyafterfour, need me to take it down and set a root pass? [07:06:47] ah ha! [07:06:51] ? [07:07:05] 161116 5:42:13 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave [07:07:13] [ERROR] Slave SQL: Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. .. [07:07:25] matt_flaschen, ^ [07:08:17] I think this one was me [07:08:38] per https://phabricator.wikimedia.org/T150764 [07:08:42] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'deployment-db03-bin.000009' position 592967822 [07:09:09] this one wasn't me [07:09:21] so if someone can get into mysql console on deployment-db04 we can restart the slave replicating [07:09:32] mutante: ^ [07:10:33] Krenair: you know how to reset the password? [07:12:19] RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 1546 bytes in 0.653 second response time [07:12:59] Amir1: eh, this? [07:13:00] ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2 "No such file or directory") [07:13:10] what do you want me to try [07:13:21] that's on db04? [07:13:25] yes [07:13:29] I just turned it off to have a go at changing the password [07:13:35] ok [07:16:18] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2795757 (10mmodell) So this happened: ``` [ERROR] Slave SQL: Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. Default da... [07:17:07] huh [07:20:02] Project beta-update-databases-eqiad build #12834: 04STILL FAILING in 1.6 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12834/ [07:24:12] oh, right [07:24:38] Hello guys! Was this released already? https://phabricator.wikimedia.org/T150604 [07:26:29] okay, that did the trick [07:27:18] twentyafterfour, you'll find the new mysql root password for -db04 at /tmp/newmysqlpass [07:28:23] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50624 bytes in 1.385 second response time [07:29:28] marostegui, looking at git log at tin.eqiad.wmnet:/srv/deployment/ocg/ocg, no [07:29:51] Krenair: Roger, thank you [07:30:17] the task it's supposed to fix still occurs too [07:30:57] -rw-r--r-- 1 udp2log udp2log 5947 Nov 16 05:25 {channel}.log [07:30:57] wut [07:32:22] how is beta so broken today? [07:34:30] it's here to take shots for prod [07:35:01] okay, I have restarted hhvm on the -mediawiki boxes [07:35:15] they got stuck with 'Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcach...ine 183' [07:36:22] okay, nope, they broke again [07:36:22] RECOVERY - App Server Main HTTP Response on deployment-mediawiki06 is OK: HTTP OK: HTTP/1.1 200 OK - 44273 bytes in 3.200 second response time [07:36:26] aand it's back [07:36:29] thanks Krenair [07:36:50] um, did someone restart again? [07:37:19] maybe I hit a node you already restarted [07:38:06] Nov 16 07:37:46 deployment-mediawiki04 hhvm[16066]: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/rdbms/data...ine 660 [07:38:38] Nov 16 07:37:46 deployment-mediawiki04 hhvm[16066]: [Wed Nov 16 07:37:46 2016] [hphp] [16066:7f077cfff700:7:000001] [] \nFatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/rdbms/database/DatabaseMysqlBase.php on line 660 [07:40:53] and the other is [07:40:55] Nov 16 07:38:18 deployment-mediawiki04 hhvm[16185]: [Wed Nov 16 07:38:18 2016] [hphp] [16185:7f640f3ff700:3:000001] [] \nFatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 [07:41:55] why do I have a funny feeling this might be related to https://gerrit.wikimedia.org/r/#/c/317304/ [07:42:17] PROBLEM - App Server Main HTTP Response on deployment-mediawiki06 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 100697 bytes in 1.430 second response time [07:44:50] well anyway, this stuff is way beyond my area [07:44:54] someone else can deal with it [07:46:09] twentyafterfour [08:18:19] Krenair: ok? [08:20:02] Project beta-update-databases-eqiad build #12835: 04STILL FAILING in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12835/ [08:20:47] it looks like you got the replication going ... [08:58:57] looks like deployment-mediawiki instances have bunch of issues [08:59:08] and https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ broke yesterday bah [08:59:37] !log beta database update broken with: MediaWiki 1.29.0-alpha Updater\n\nYour composer.lock file is up to date with current dependencies! [08:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:04:58] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2798189 (10hashar) p:05Triage>03Normal [09:05:28] 10Beta-Cluster-Infrastructure, 06Operations, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2792043 (10hashar) p:05Triage>03Normal [09:05:42] 10Beta-Cluster-Infrastructure, 06Labs, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: rename -labs.php to -beta.php - https://phabricator.wikimedia.org/T150268#2780096 (10hashar) p:05Triage>03Low [09:12:08] 10Beta-Cluster-Infrastructure, 07HHVM: beta cluster app servers no more respond to http request / beta web access is down - https://phabricator.wikimedia.org/T150833#2798202 (10hashar) [09:12:28] !log deployment-mediawiki04 stopping hhv [09:12:30] !log deployment-mediawiki04 stopping hhvm [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:15:20] hashar: by any chance you know if this will be deployed today? https://phabricator.wikimedia.org/T150604#2797315 [09:18:27] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [09:18:50] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798220 (10hashar) [09:20:02] Project beta-update-databases-eqiad build #12836: 04STILL FAILING in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12836/ [09:20:39] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798232 (10hashar) And there is a nice stack overflow: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line... [09:23:35] 10Beta-Cluster-Infrastructure, 07HHVM: beta cluster app servers no more respond to http request / beta web access is down - https://phabricator.wikimedia.org/T150833#2798235 (10hashar) And there is a nice stack overflow: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/Ba... [09:30:54] PROBLEM - SSH on deployment-mediawiki04 is CRITICAL: Connection refused [09:31:25] PROBLEM - Host deployment-mediawiki04 is DOWN: CRITICAL - Host Unreachable (10.68.19.128) [09:34:25] RECOVERY - Host deployment-mediawiki04 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [09:35:55] RECOVERY - SSH on deployment-mediawiki04 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [09:36:35] 10Beta-Cluster-Infrastructure, 07HHVM: On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798240 (10hashar) [09:39:20] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 17537 bytes in 4.502 second response time [09:39:20] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 44265 bytes in 2.913 second response time [09:40:24] 10Beta-Cluster-Infrastructure, 07HHVM: On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798249 (10hashar) On deployment-mediawiki04 under /var/log/hhvm HHVM has a coredump ``` name=/var/lo... [09:43:28] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:44:47] 10Beta-Cluster-Infrastructure, 07HHVM: On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798267 (10hashar) p:05Triage>03Unbreak! [09:45:26] PROBLEM - App Server Main HTTP Response on deployment-mediawiki04 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:26] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:23] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798220 (10mmodell) @hashar: I believe it was broken by {T150764} [09:46:42] marostegui: for OCG, I guess that is the services team deploying it in their own window [09:46:52] marostegui: maybe Marko would know how to push that one [09:47:37] doc being apparently https://wikitech.wikimedia.org/wiki/OCG#Deploying_changes [09:47:56] hashar: Ah ok - thanks! [09:48:17] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798275 (10mmodell) see also {rMWb47ce21cec3a4340dd37c773210a514350f10297} [09:49:48] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798277 (10hashar) I havent tried to find the triggered exception yet. I am looking at T150833 which causes: Fatal error: Stack overflow in /srv/mediawiki/ph... [09:51:13] !log marking deployment-tin offline so I can live hack mediawiki code / scap for T150833 and T15034 [09:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:51:22] 03Scap3, 10Parsoid, 06Services (done), 15User-Joe, 15User-mobrovac: Enable Scap3 config deploys for Parsoid - https://phabricator.wikimedia.org/T144596#2798287 (10mobrovac) 05Open>03Resolved Transition completed, resolving. [09:53:54] what a mess [09:55:01] hashar: indeed, if you read scrollback you'll see that several of us worked on it for a long time and it's still broken [09:55:14] oh [09:55:19] I lack the scroll back though : D [09:55:37] did it start happening with the introduction of dewiktionary on beta? [09:56:01] that's what caused beta-update-databases-eqiad to start failing [09:56:15] yeah [09:56:25] and on top of that there is the segfault/Fatal error in objectcache.php [09:56:29] both are probably unrelated [09:56:32] and the stack overflows I believe started with rMWb47ce21cec3a4340dd37c773210a514350f10297 or the related commits [09:56:43] yes the two are unrelated [09:56:59] I am trying to find out when fatal errors started [09:57:02] but... the dewiktionary stuff DID break replication on deployment-db04 [09:57:14] which krenair fixed, apparently [09:57:50] fatals started right around when b47ce21cec3a4340dd37c773210a514350f10297 landed [10:00:33] logstash shows the first stack overflow at 5:16am UTC [10:00:45] and the code got updated at 5:13 with https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/130347/console [10:00:58] that shows the bump of mw core ea42d90..32f3a99 [10:02:59] * 32f3a99 Merge "objectcache: Remove broken cas() method from WinCacheBagOStuff" [10:02:59] |\ [10:02:59] | * d1b53e3 objectcache: Remove broken cas() method from WinCacheBagOStuff [10:03:00] * b47ce21 objectcache: detect default getWithSetCallback() set options [10:04:09] poor Aaron :( [10:04:49] 10Beta-Cluster-Infrastructure, 07HHVM: On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798347 (10hashar) From logstash, the first stack overflow occurred on 2016-11-16T05:16:16 The Jenkin... [10:05:17] twentyafterfour: yup the objectcache.php change is most probably the reason [10:05:27] I wanted to back up the claim with logs/traces etc and double confirm [10:05:43] + public function declareUsageSectionEnd( $id ) { [10:05:43] + return $this->__call( __FUNCTION__, func_get_args() ); [10:05:44] ... [10:06:26] I am taking a break [10:06:31] will revert both patch on beta cluster [10:06:33] scap it [10:06:38] and see whether that fix the issue [10:06:43] then revert both patches in mediawiki/core [10:06:53] but that is after more coffee / natural call etc :-} [10:26:44] !log Reverting mediawiki/core b47ce21cec3a4340dd37c773210a514350f10297 on beta cluster T150833 [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:32:19] RECOVERY - App Server Main HTTP Response on deployment-mediawiki06 is OK: HTTP OK: HTTP/1.1 200 OK - 44264 bytes in 5.213 second response time [10:33:20] deal [10:33:22] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 44706 bytes in 2.460 second response time [10:35:22] RECOVERY - App Server Main HTTP Response on deployment-mediawiki04 is OK: HTTP OK: HTTP/1.1 200 OK - 44274 bytes in 3.499 second response time [10:36:23] 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798447 (10hashar) I have reverted b47ce21cec3a4340dd37c773210a514350f10297 on t... [10:38:19] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798450 (10hashar) I had the commit triggering T150833 reverted on beta cluster but update.php (`mwscript update.php --wiki=enwiki --quick`) still fails. So that is... [10:39:20] !log Removing revert b47ce21cec3a4340dd37c773210a514350f10297 from deployment-tin and reenabling jenkins job. https://gerrit.wikimedia.org/r/321857 will get it fixed [10:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:39:55] Project beta-update-databases-eqiad build #12837: 04STILL FAILING in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12837/ [10:40:00] Project beta-code-update-eqiad build #130375: 15ABORTED in 7.4 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/130375/ [10:40:08] Project beta-scap-eqiad build #129111: 15ABORTED in 7.9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/129111/ [10:44:48] Project beta-scap-eqiad build #129112: 04FAILURE in 0.27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/129112/ [10:46:25] twentyafterfour: I am reverting the mwcore patch so it is all good :} thx for the pointer [10:54:47] Project beta-scap-eqiad build #129113: 04STILL FAILING in 0.27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/129113/ [10:58:59] Project beta-scap-eqiad build #129114: 04STILL FAILING in 0.27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/129114/ [11:00:41] OSError: [Errno 17] File exists: '/var/lock/scap' [11:00:42] bahh [11:02:51] Yippee, build fixed! [11:02:51] Project beta-scap-eqiad build #129115: 09FIXED in 1 min 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/129115/ [11:04:44] 10Beta-Cluster-Infrastructure, 07HHVM, 05MW-1.29-release-notes, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798533 (10h... [11:20:02] Project beta-update-databases-eqiad build #12838: 04STILL FAILING in 1.7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12838/ [11:30:00] 10Continuous-Integration-Config, 10Wikidata: Run Wikidata browser tests on testwikidata via Jenkins - https://phabricator.wikimedia.org/T105985#2798658 (10Tobi_WMDE_SW) [11:30:02] 10Continuous-Integration-Config, 10Wikidata, 07Browser-Tests, 07Story: [Story] Run browsertests regularly on test.wikidata.org via Jenkins - https://phabricator.wikimedia.org/T101497#2798659 (10Tobi_WMDE_SW) [11:30:05] 10Browser-Tests-Infrastructure, 10Wikidata, 07Tracking: Wikidata Browsertests (tracking) - https://phabricator.wikimedia.org/T88541#2798661 (10Tobi_WMDE_SW) [11:30:08] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 07Epic, 05MW-1.28-release-notes, and 3 others: Fix scenarios that fail at en.wikipedia.beta.wmflabs.org or do not run them daily - https://phabricator.wikimedia.org/T94150#2798660 (10Tobi_WMDE_SW) [11:34:29] 10Browser-Tests-Infrastructure, 10Wikidata: mediawiki_api::log_in does not work due to gzip issue - https://phabricator.wikimedia.org/T127309#2798667 (10Tobi_WMDE_SW) 05Open>03Resolved a:03Tobi_WMDE_SW I was not able to reproduce this, so it probably got fixed in the meantime or in a later version of the... [11:36:18] 10Browser-Tests-Infrastructure: selenium fails to connect to firefox (headless not sauce) - https://phabricator.wikimedia.org/T117561#2798671 (10Tobi_WMDE_SW) [11:46:58] 03Scap3, 10ContentTranslation-CXserver, 10MediaWiki-extensions-ContentTranslation, 05Language-Engineering October-December 2016, and 4 others: Enable Scap3 config deploys for CXServer - https://phabricator.wikimedia.org/T147634#2798685 (10mobrovac) a:05KartikMistry>03mobrovac The two patches above need... [11:48:46] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences still gives out database locked [11:51:09] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798696 (10hashar) ``` $ strace -f -y -s1024 /usr/local/bin/mwscript update.php --wiki=aawiki --quick execve("/usr/local/bin/mwscript", ["/usr/local/bin/mwscript",... [11:52:13] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798697 (10hashar) p:05Triage>03Unbreak! [11:52:14] :-/ [11:52:28] can't believe I am going to spend my whole day just to unbreak all that mess [12:00:10] :( [12:01:31] :/ [12:04:37] and you know or code is crap [12:04:54] gripping "Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information. [12:04:55] " [12:04:58] show 6 occurrences :( [12:05:03] COPY PASTE IS EVIL [12:07:17] AHHH [12:09:04] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798713 (10hashar) I could not figure out why `$wgShowExceptionDetails` is not true when running update.php so I have just live hacked the related PHP files: * incl... [12:13:54] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798715 (10hashar) On the slave we have Error 'Can't drop database 'dewiktionary'; database doesn't exist' on query. Default database: 'dewiktionary'. Query: '... [12:24:28] !log beta: created dewiktionary table on the Database slave. Restarted replication with START SLAVE; T150834 T150764 [12:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:24:56] fixd [12:25:06] Project selenium-RelatedArticles » chrome,beta-mobile,Linux,contintLabsSlave && UbuntuTrusty build #211: 04FAILURE in 24 min: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-mobile,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/211/ [12:25:08] Project selenium-RelatedArticles » chrome,beta-desktop,Linux,contintLabsSlave && UbuntuTrusty build #211: 04FAILURE in 24 min: https://integration.wikimedia.org/ci/job/selenium-RelatedArticles/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta-desktop,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/211/ [12:26:49] Project beta-update-databases-eqiad build #12839: 04STILL FAILING in 17 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12839/ [12:26:52] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798731 (10hashar) 05Open>03Resolved a:03hashar That has been caused by the addition of the German Wiktionary ( T150764 ). The MySQL master/slave replication... [12:27:28] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2795757 (10hashar) From T150834: That has been caused by the addition of the German Wiktionary ( T150764 ). The MySQL master/slave replication got... [12:27:50] oh my [12:28:26] hashar i belive the problem with $wgShowExceptionDetails = true; is a known problem, i think i found having that problem too [12:28:32] when trying to debug postgres [12:29:34] hashar: thanks! [12:29:57] Sorry it happened. Mostly it was because the creation wasn't completed due to an issue in flow tables [12:30:21] sorry, flow external datasources. You can find it in the patch [12:30:28] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798740 (10hashar) 05Resolved>03Open And the job fails: ``` $ mwscript update.php --wiki=dewiktionary --quick #!/usr/bin/env php MediaWiki 1.29.0-alpha Update... [12:30:38] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2798742 (10hashar) And the job fails: ``` $ mwscript update.php --wiki=dewiktionary --quick #!/usr/bin/env php MediaWiki 1.29.0-alpha Updater Your... [12:34:14] I am just reverting [12:37:25] It's blocked on https://gerrit.wikimedia.org/r/321810 [12:37:38] once this is tested and merged we are good to go [12:46:06] Project selenium-GettingStarted » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #210: 04FAILURE in 24 min: https://integration.wikimedia.org/ci/job/selenium-GettingStarted/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/210/ [12:52:57] https://en.wikipedia.beta.wmflabs.org/w/api.php gateway timeout bah [12:54:24] Yippee, build fixed! [12:54:24] Project beta-update-databases-eqiad build #12840: 09FIXED in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12840/ [12:54:44] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Beta update.php broken since Nov 15th 19:20 - https://phabricator.wikimedia.org/T150834#2798806 (10hashar) 05Open>03Resolved 12:54 UTC Project beta-update-databases-eqiad build #12840: FIXED in 1 min 7 sec: https://integration... [12:55:53] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2798810 (10hashar) To unblock T150763, the `dewiktionary` database is no more on the MySQL master and slave. I also removed it from the dblist and wi... [12:56:47] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798812 (10hashar) [12:56:54] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798824 (10hashar) p:05Triage>03Unbreak! [13:02:33] !log Restarted HHVM on deployment-mediawiki05 was not honoring requests T150849 [13:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:02:56] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798830 (10hashar) [13:03:20] RECOVERY - App Server Main HTTP Response on deployment-mediawiki05 is OK: HTTP OK: HTTP/1.1 200 OK - 44262 bytes in 3.517 second response time [13:03:30] Yippee, build fixed! [13:03:30] Project selenium-GettingStarted » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #211: 09FIXED in 48 sec: https://integration.wikimedia.org/ci/job/selenium-GettingStarted/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/211/ [13:03:33] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798812 (10hashar) 05Open>03Resolved a:03hashar Fixed by restarting HHVM on deployment-mediawiki05. [13:05:58] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798835 (10hashar) From IRC logs: ``` lang=irc [06:32:19] PROBLEM - App Server Main HTTP Response on deployment-mediawiki05 is CRITICAL: HTTP... [13:06:08] 10Beta-Cluster-Infrastructure: https://en.wikipedia.beta.wmflabs.org/w/api.php 504 Server Error: Gateway Time-out - https://phabricator.wikimedia.org/T150849#2798840 (10hashar) [13:06:10] 10Beta-Cluster-Infrastructure, 07HHVM, 05MW-1.29-release-notes, 13Patch-For-Review, 05WMF-deploy-2016-11-29_(1.29.0-wmf.4): On beta cluster: Fatal error: Stack overflow in /srv/mediawiki/php-master/includes/libs/objectcache/BagOStuff.php on line 754 - https://phabricator.wikimedia.org/T150833#2798202 (10h... [13:46:47] Yippee, build fixed! [13:46:47] Project selenium-VisualEditor » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #214: 09FIXED in 2 min 46 sec: https://integration.wikimedia.org/ci/job/selenium-VisualEditor/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/214/ [14:44:30] Project selenium-Wikibase » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #178: 15ABORTED in 1 hr 9 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/178/ [14:44:31] Project selenium-Wikibase » chrome,test,Linux,contintLabsSlave && UbuntuTrusty build #178: 15ABORTED in 1 hr 9 min: https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/178/ [15:32:16] PROBLEM - Puppet run on deployment-sca01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:34:31] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:35:45] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:00:46] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:07:17] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:09:29] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:17:54] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07Regression: doc.wikimedia.org displays "403 Forbidden" for coverage sub directories - https://phabricator.wikimedia.org/T150727#2799454 (10hashar) https://gerrit.wikimedia.org/r/321651 makes Apache to honor .htaccess... [16:30:32] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2799514 (10hashar) [16:30:48] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2795939 (10hashar) [16:32:30] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2799534 (10hashar) contint1001 is a rather large machine and I am not aware of what is available in... [16:32:56] thcipriani: good morning. I got the hardware request filled for a contint2001.codfw.wmnet machine ( https://phabricator.wikimedia.org/T150865 ) you are subscribed obviously [16:33:48] hashar: howdy, yup, saw that in email :) [16:34:10] also got the CI staging setup we talked about filed: https://phabricator.wikimedia.org/T150772 [16:34:22] if they get some server available in codfw that match contint1001, that will probably be a fast allocation [16:34:29] neat! [16:34:55] and eventually we will want to start puppetizing Jenkins [16:35:21] yeah, puppetizing jenkins plugins will be interesting [16:35:45] I think the CI staging area will let us try out some different ideas there [16:37:59] this, for instance, seems like a very bad idea, https://git.openstack.org/cgit/openstack-infra/puppet-jenkins/tree/manifests/plugin.pp#n67 [16:44:05] PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:02:37] thcipriani: I was with Greg on video [17:02:51] and saying exactly the same: leverage staging to experiment new things / puppetize from scratch :D [17:02:52] blame it all on me [17:02:57] oh right [17:02:58] :) [17:03:03] ahh --no-check-certificate booh [17:03:29] thcipriani: for the plugins, we can probably build them from source and push them to archiva.wikimedia.org [17:03:37] it is used / published to by analytics [17:03:43] or [17:04:09] wrap them in Debian packages. [17:08:02] legoktm: they (cloudbees) emailed me too anyways :) Hopefully my "I'm the manager in charge of Release Engineering and QA" and "no" is sufficient. [17:10:02] Debian Java Team has packaged Jenkins embedded libraries [17:10:07] they are in Jessie [17:10:15] but the stripped Jenkins is not included/got removed ( https://packages.qa.debian.org/j/jenkins.html ) [17:10:18] ... [17:11:08] https://lists.debian.org/debian-release/2015/04/msg00209.html [17:11:18] Jenkins LTS cycle does not match Debian ones. [17:11:21] so makes sense [17:12:09] hi hashar & greg-g :) [17:12:25] hello [17:15:03] oh Luke081515 [17:15:09] :) [17:15:11] hi :) [17:15:34] yep, using a different nick is ometimes a bit confusing ;) [17:16:05] /whois to the rescue [17:16:26] yep, it do it too usually :) [17:16:44] first thing I did :D [17:16:55] anyway I am disappearing / commuting back home [17:17:10] dewiktionary on beta will need to be recreated [17:17:15] though it is blocked on a change in Flow iirc [17:17:19] hm [17:17:36] hashar: I have some ressources for some bouncers left at my server, if you want one ;) [17:18:14] I am on some private channels though :D [17:18:23] and lack of a bouncer is actually a good thing [17:18:29] A) I can pretend I am not aware of something [17:18:45] B) I dont have to spend an hour every morning reading bunch of backlogs and thus start with a white sheet of paper [17:19:02] (eventually I should delete all my pending emails at midnight as well) [17:19:07] to start all fresh and anew every morning! [17:19:14] thx for the offer though :} [17:19:37] :D interesting logic [17:24:04] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [17:51:12] 64GB RAM for contint2001 ? [17:51:23] would that be enough, heh [17:52:26] i checked the spares list, looks like we have one that is matching contint1001 [17:52:44] but gotta go through the procurement process, comented on ticket [17:54:41] thanks mutante ! :) [18:24:22] PROBLEM - Host integration-puppetmaster is DOWN: CRITICAL - Host Unreachable (10.68.16.42) [18:24:50] twentyafterfour, hasharAway: did you guys end up using mysql root on -db04? [18:30:20] Krenair: I logged in with it but didn't end up doing anything [18:30:29] just verified that replication was running [18:45:38] twentyafterfour, at some point we should take down -db03 and set the password there too [18:57:23] Thanks, Krenair, twentyafterfour. I don't know how they created dewiktionary only on the master before, though (and in general how the tables were created). [18:58:29] I'll follow up later today and hopefully addWiki.php just works. [18:58:42] If they created it on the master, it should have replicated to the slave. [19:36:32] greg-g: Any idea why this is not triggering gate and submit: https://gerrit.wikimedia.org/r/#/c/320405/ ? [19:40:10] kaldari: I'm not sure, it's not showing a dependent patch anymore... [19:40:27] should I just submit it? [19:40:56] kaldari you need to re c+2 [19:40:59] ok [19:41:38] Thanks [19:41:48] it says the parent is not current? [19:42:03] greg-g: What does that mean? [19:42:17] no idea! [19:42:48] greg-g patch set 5 shows parent as 44a9c1bf59b110e3d04b6a5083242f981f4de832 [19:43:06] so the change should merge once he removes c+2 and re do c+2 [19:43:31] I search for that change in gerrit and it can't find it [19:43:35] https://gerrit.wikimedia.org/r/#/q/44a9c1bf59b110e3d04b6a5083242f981f4de832 [19:43:56] Oh your correct [19:44:00] I'll rebase and +2 [19:44:03] kaldari try doing rebase [19:44:08] and put in master [19:44:26] "not current" apparently means "doesn't exist" thanks gerrit [19:45:16] there it goes! :) [19:45:22] :) [19:45:25] ah, of course :P [19:51:47] Krenair: I could not found the beta cluster databases credential and ended up with wikiadmin: sql --write aawiki [19:52:01] hasharAway i found the bug https://phabricator.wikimedia.org/T148957 [19:52:08] for why you coulden debug the updater [19:52:09] Krenair: then create database dewiktionary; on slave + SLAVE START [19:52:31] paladox: hello :) [19:52:36] Hi :) [19:52:47] no clue what is up with update.php [19:53:08] wikiadmin could start replication? interesting [19:53:15] Ok [19:53:51] Krenair: I guess perms are wide open [19:55:07] GRANT ALL PRIVILEGES ON *.* TO 'wikiadmin'@'%' IDENTIFIED BY PASSWORD [19:55:08] rre [19:55:09] ffs* [19:57:29] hashar, okay, taking the master down to fix perms [19:59:29] auth will be via unix socket rather than password for root [20:03:00] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2795757 (10greg) >>! In T150764#2800153, @Mattflaschen-WMF wrote: >>>! In T150764#2798735, @hashar wrote: >> That has been caused by the addition of... [20:03:08] So in prod, wikiadmin has these grants: [20:03:37] GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'wikiadmin'@'10.%' IDENTIFIED [20:03:43] GRANT ALL PRIVILEGES ON `%wik%`.* TO 'wikiadmin'@'10.%' [20:03:47] GRANT SELECT ON `heartbeat`.`heartbeat` TO 'wikiadmin'@'10.%' [20:06:27] wikiuser: [20:06:29] GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'wikiuser'@'10.64.%' [20:06:36] GRANT SELECT, INSERT, UPDATE, DELETE ON `%wik%`.* TO 'wikiuser'@'10.64.%' [20:06:40] GRANT SELECT ON `heartbeat`.`heartbeat` TO 'wikiuser'@'10.64.%' [20:06:56] let's replicate those, almost (our network is setup a little differently so obviously not 10.64) [20:07:47] hm, looks like we don't have the heartbeat db set up :/ [20:14:24] Krenair: sync with jaime please [20:14:32] he is working on mysql over socket iirc [20:17:38] does he deal with deployment-prep dbs? [20:21:30] hashar: o/ anything against me increaseing verbosity of mod_rewrite on mediawiki-06 to test a thing? [20:21:47] Krenair: yes [20:22:16] Krenair: DBA are providing assistance guidance for the database. Cause that is a narrow field and nobody knows how to get it right beside an actual DBA :] [20:22:39] elukey: do do do :]  dont forget to remove it eventually [20:22:56] elukey: you might want to disable puppet as well or it might well overwrite your hack [20:23:10] elukey: beta being a shared platform... {{be bold}} [20:23:13] hashar, you can tell him [20:25:40] Krenair: I was refering to https://gerrit.wikimedia.org/r/#/c/321878/ done/merged today [20:26:54] marxarelli might know how the permission got setup for the beta cluster database. He did the migration with jaime [20:27:32] hashar: which permissions? [20:27:49] everything was migrated from db1, including the mysql database [20:28:53] !log temporary increasing verbosity of mod_rewrite on deployment-mediawiki06 as test [20:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:29:06] hashar, right, I left a note there that I put it to use [20:30:06] marxarelli, btw, one thing I noticed about that [20:30:28] | user | host | password | [20:30:31] | root | deployment-db1 | | [20:30:59] there is still a user for the old host? [20:32:23] ah, looks like it [20:32:26] should delete that [20:34:15] PROBLEM - Puppet run on deployment-elastic07 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:34:57] PROBLEM - Puppet run on integration-saltmaster is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:35:02] ugh, hang on [20:35:03] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:35:19] I forgot one thing on -db04 [20:35:31] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:35:39] PROBLEM - Puppet run on deployment-ms-be02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:36:34] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2800470 (10RobH) a:03mark contint1001 has Dual Intel® Xeon® Processor E5-2640 v3 (2.6GHz/8c), dua... [20:36:40] PROBLEM - Puppet run on deployment-memc05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:37:37] okay, now it seems to work properly there too [20:37:42] PROBLEM - Puppet run on integration-slave-trusty-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:38:00] PROBLEM - Puppet run on deployment-eventlogging04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:38:18] PROBLEM - Puppet run on deployment-db04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:38:20] PROBLEM - Puppet run on deployment-elastic05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:38:48] PROBLEM - Puppet run on deployment-restbase01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:38:58] Krenair: just dropped the root@deployment-db1, fyi [20:39:15] the root@deployment-db1 *user* [20:39:27] ty [20:39:30] PROBLEM - Puppet run on deployment-cache-upload04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:40:43] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:41:28] Project selenium-Echo » chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #212: 04FAILURE in 27 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/212/ [20:41:35] Project selenium-Echo » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #212: 04FAILURE in 34 sec: https://integration.wikimedia.org/ci/job/selenium-Echo/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/212/ [20:43:47] PROBLEM - Puppet run on deployment-cache-text04 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:43:51] PROBLEM - Puppet run on deployment-mathoid is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:43:57] PROBLEM - Puppet run on integration-slave-precise-1012 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:44:05] PROBLEM - Puppet run on integration-slave-trusty-1003 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:44:28] PROBLEM - Puppet run on deployment-conf03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:45:04] PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:45:23] PROBLEM - Puppet run on deployment-kafka04 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:45:38] PROBLEM - Puppet run on deployment-parsoid09 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:45:51] PROBLEM - Puppet run on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:46:55] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:47:29] PROBLEM - Puppet run on deployment-zookeeper01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:47:29] PROBLEM - Puppet run on deployment-logstash2 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:47:41] PROBLEM - Puppet run on deployment-redis02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:47:49] PROBLEM - Puppet run on integration-slave-jessie-1002 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:48:36] PROBLEM - Puppet run on deployment-elastic06 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:48:46] PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:48:48] PROBLEM - Puppet run on deployment-fluorine02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:49:08] PROBLEM - Puppet run on deployment-jobrunner02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:49:10] PROBLEM - Puppet run on deployment-salt02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:49:27] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:49:29] PROBLEM - Puppet run on deployment-urldownloader is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:50:01] PROBLEM - Puppet run on deployment-mira is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:50:05] PROBLEM - Puppet run on deployment-apertium02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:50:13] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:50:23] PROBLEM - Puppet run on deployment-changeprop is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:50:29] PROBLEM - Puppet run on deployment-pdfrender02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:51:46] PROBLEM - Puppet run on deployment-sca03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:52:58] PROBLEM - Puppet run on deployment-prometheus01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:53:02] PROBLEM - Puppet run on castor is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:53:20] !log restored apache2 config on deployment-mediawiki06 [20:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:53:31] (results of the experiment - https://phabricator.wikimedia.org/T57857#2800519) [20:53:32] PROBLEM - Puppet run on integration-puppetmaster01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:53:57] hashar, I'm done messing with things in beta for now, let me know if you notice any problems (beyond the labs-puppet-breakage above) [21:07:40] RECOVERY - Puppet run on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:08:56] 03Scap3, 10Parsoid: Env vars being overwritten - https://phabricator.wikimedia.org/T150897#2800601 (10Arlolra) [21:10:31] Krenair: marxarelli: the Jenkins job that runs update.php seems all happy :] ( https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/12849/console ) [21:13:01] RECOVERY - Puppet run on deployment-eventlogging04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:15] RECOVERY - Puppet run on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:19] RECOVERY - Puppet run on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:13:25] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:14:16] RECOVERY - Puppet run on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [21:14:30] RECOVERY - Puppet run on deployment-cache-upload04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:14:56] RECOVERY - Puppet run on integration-saltmaster is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:04] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:30] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:41] RECOVERY - Puppet run on deployment-ms-be02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:16:43] RECOVERY - Puppet run on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:17:41] RECOVERY - Puppet run on integration-slave-trusty-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [21:18:48] RECOVERY - Puppet run on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:19:06] RECOVERY - Puppet run on integration-slave-trusty-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [21:20:04] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [21:20:44] RECOVERY - Puppet run on integration-slave-precise-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:54] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:42] 10Beta-Cluster-Infrastructure, 06Operations, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2800637 (10fgiunchedi) @Krenair where are you seeing that btw? The issue afaics is that swift on `deployment-ms-fe01` doesn't have the password for `mw:thumbor` in `/et... [21:23:19] elukey: still around? [21:23:49] RECOVERY - Puppet run on deployment-cache-text04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:23:49] elukey: apache config test used to be a thing. We can excavate a bunch of already existing test scripts ;] lets poke each other tomorrow [21:23:53] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [21:23:57] RECOVERY - Puppet run on integration-slave-precise-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [21:24:28] RECOVERY - Puppet run on deployment-conf03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:24:30] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:14] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:20] RECOVERY - Puppet run on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:36] RECOVERY - Puppet run on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:50] RECOVERY - Puppet run on integration-slave-jessie-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [21:26:47] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:27:29] RECOVERY - Puppet run on deployment-zookeeper01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:27:29] RECOVERY - Puppet run on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [21:27:51] RECOVERY - Puppet run on integration-slave-jessie-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [21:28:01] commented on task [21:28:01] RECOVERY - Puppet run on castor is OK: OK: Less than 1.00% above the threshold [0.0] [21:28:33] RECOVERY - Puppet run on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:28:35] RECOVERY - Puppet run on deployment-elastic06 is OK: OK: Less than 1.00% above the threshold [0.0] [21:28:47] RECOVERY - Puppet run on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [21:29:09] RECOVERY - Puppet run on deployment-jobrunner02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:29:11] RECOVERY - Puppet run on deployment-salt02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:29:30] RECOVERY - Puppet run on deployment-urldownloader is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:00] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:06] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:24] RECOVERY - Puppet run on deployment-changeprop is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:32] RECOVERY - Puppet run on deployment-pdfrender02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:32:58] RECOVERY - Puppet run on deployment-prometheus01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:39:26] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Operations, 10hardware-requests: codfw: 1 hardware access request for continuous integration - https://phabricator.wikimedia.org/T150865#2800694 (10hashar) @RobH pointed out contint1001 does not use SSD and that might be an IO bottlenec... [21:58:29] Project selenium-Core » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #220: 04FAILURE in 6 min 28 sec: https://integration.wikimedia.org/ci/job/selenium-Core/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/220/ [22:05:18] 10Continuous-Integration-Infrastructure, 06Operations, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2800743 (10hashar) Will probably want to cleanup apt.wm.o jessie-wikimedia/backports I will reach out to European ops to... [22:17:04] what's happening on beta? "Sorry! This site is experiencing technical difficulties. Cannot access the database".... [22:20:41] wfm? [22:21:01] Project selenium-CentralAuth » firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #213: 04FAILURE in 1 min 0 sec: https://integration.wikimedia.org/ci/job/selenium-CentralAuth/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/213/ [22:22:13] Krenair: i get it when trying to login [22:22:44] huh [22:23:35] oh, right [22:23:36] my bad [22:24:29] etonkovidova, greg-g try now [22:25:12] got further, just logging into LP now :) [22:25:25] yup [22:25:38] Krenair: greg-g All looks normal - thx! [22:25:49] when I was making wikiadmin/wikiuser permissions like production's, I forgot about the centralauth database [22:42:48] I've updated the arcanist installer for windows https://github.com/paladox/Arcanist-installer-for-windows/releases/tag/1.7.0 :) [22:43:06] So easy to install arcanist on windows. [22:45:14] twentyafterfour ^^ :) [22:47:39] (03PS1) 10Gergő Tisza: [EmailAuth] add standard endpoints [integration/config] - 10https://gerrit.wikimedia.org/r/322004 [22:49:42] (03CR) 10Paladox: [C: 031] [EmailAuth] add standard endpoints [integration/config] - 10https://gerrit.wikimedia.org/r/322004 (owner: 10Gergő Tisza) [23:20:21] Project beta-update-databases-eqiad build #12852: 04FAILURE in 20 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/12852/ [23:34:28] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2801053 (10Mattflaschen-WMF) addWiki failed saying the database existed already (?), so now I have to delete it again. [23:39:04] 10Beta-Cluster-Infrastructure, 06WMDE-TLA-Team, 13Patch-For-Review, 15User-Ladsgroup: Make beta German Wiktionary - https://phabricator.wikimedia.org/T150764#2801056 (10Mattflaschen-WMF) I think addWiki.php somehow creates it twice, not sure where. [23:50:05] twentyafterfour, bd808, ostriches, Reedy, any of you available to review a one-liner to fix addWiki.php (and that job)? [23:50:17] sure [23:50:26] matt_flaschen: gladly [23:50:36] twentyafterfour, thanks: https://gerrit.wikimedia.org/r/#/c/322017/ . [23:51:34] lgtm [23:51:36] how many one-liners does it really take to fix addWiki.php [23:52:11] Krenair, hopefully two. I tried to test it locally yesterday, but it just doesn't work. The Vagrant multi-wiki is too different. [23:52:18] I think this is the last one. [23:52:20] hah [23:52:22] yeah [23:52:53] once upon a time I thought that addWiki could be fixed once and for all [23:53:07] experience has shown that every fix to that script is temporary [23:53:08] None of this is specific to dewiktionary or anything new. I guess people keep just doing one-off fixes. [23:53:19] I need to rewrite addWiki [23:53:21] next month a new developer will find a new way to break it [23:53:24] Or not fixing the script at all and just fixing the wiki they're setting up. [23:53:27] $backlog++ [23:54:02] Okay, to be fair, the External Store is new and I broke that (then fixed it). But it didn't even get to there today before breaking. [23:54:06] when it breaks in production (which is almost as common as it running in production), we try to fix it [23:54:23] It breaks every time! [23:54:38] (03PS1) 10Dzahn: delete .htaccess files for doc/integration [integration/docroot] - 10https://gerrit.wikimedia.org/r/322020 (https://phabricator.wikimedia.org/T150727) [23:54:54] "I want to create a wiki" -> runs addwiki and it breaks -> "Shit, lemme fix this." -> creates wiki [23:54:58] ostriches: I find, when you create 4 wikis in a row, it only breaks the on the first one [23:54:59] :P [23:55:02] Go back to the start :p [23:55:25] Reedy: The solution, clearly, is to stop creating wikis! [23:55:40] Dereckson did two the other day, obviously it worked the second time because he live-hacked it to work the first time [23:55:52] They are bad for your health [23:55:58] except the first time was really the first two times because he had to run it, change it to comment the stuff that it had done, then run it again [23:56:56] It's not too bad until you have to keep deleting the whole database in production to re-run it