[00:00:01] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -1" [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn) [00:00:05] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T0000). [00:00:42] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 1.728e+05 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [00:01:54] !log temp. disable puppet on netmon1002,netmon2001 before applying puppet ferm change out of abundance of caution. applied gerrit:439554 without issues on netmon1003 [00:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:52] (03CR) 10Dzahn: [C: 032] "noop on netmon1003 -> netmon1002 -> netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/439554 (owner: 10Dzahn) [00:04:33] !log deploying phabricator release/2018-07-25/1, scheduled downtime for phab1001 in icinga [00:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:24] (03CR) 10Dzahn: "everything merged for now on the topic branch for this: https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+branch:production+top" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [00:12:53] (03CR) 10Dzahn: "eh yea. there's a second topic branch as well and there are still some left to do beyond those: https://gerrit.wikimedia.org/r/#/q/projec" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [00:14:01] (03PS1) 10Dzahn: mediawiki_maintenance: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447935 [00:16:29] !log phabricator update complete with only momentary downtime [00:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:58] (03PS1) 10Dzahn: maps: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447937 [00:22:00] (03PS1) 10Dzahn: grafana: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447938 [00:22:02] (03PS1) 10Dzahn: deployment_server: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447939 [00:22:04] (03PS1) 10Dzahn: debug_proxy: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447940 [00:22:06] (03PS1) 10Dzahn: osm: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447941 [00:22:08] (03PS1) 10Dzahn: tendril: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447942 [00:23:51] (03CR) 10Dzahn: "is there maybe a ticket about adding avatar support to Gerrit? that would be probably good. if there is one please link it" [puppet] - 10https://gerrit.wikimedia.org/r/440104 (owner: 10Paladox) [00:24:27] (03PS6) 10Paladox: Gerrit: Clone avatars repo into /var/www/avatars [puppet] - 10https://gerrit.wikimedia.org/r/440104 (https://phabricator.wikimedia.org/T191183) [00:26:44] (03CR) 10Dzahn: [C: 04-1] "-0.5" [dns] - 10https://gerrit.wikimedia.org/r/441817 (https://phabricator.wikimedia.org/T189637) (owner: 10MarcoAurelio) [00:30:43] (03PS5) 10Dzahn: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [00:32:06] paladox: i assume memory limit stuff will always require service restart [00:32:24] possibly yeh [00:37:34] (03CR) 10Dzahn: [C: 031] "since it's just a general performance tuning thing, maybe waiting for a maintenance window with more changes to minimize the times that us" [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [00:40:21] (03CR) 10Dzahn: "i really don't want to make a decision about killing GWTUI here (/me cries, lol) but if there is a general consensus for this i will deplo" [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [00:41:11] (03CR) 10Dzahn: "needs list mail?" [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [00:42:13] (03CR) 10Paladox: "This dosen't kill GWTUI. Will still be available. Just PolyGerrit will be promoted to default ui." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [00:43:26] (03CR) 10Dzahn: "is the commit message really correct? says it renames a file from polygerrit-style but adds a link from/to gerrit-theme.html ?" [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [00:43:28] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [00:43:31] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [00:43:32] (03CR) 10Paladox: "> needs list mail?" [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [00:43:34] Read what IRC investigative journalists have uncovered on the freenode pedophilia scandal https://encyclopediadramatica.rs/Freenodegate [00:43:37] [00:43:40] This message was brought to you by Private Internet Access [00:45:29] (03PS4) 10Paladox: Link to gerrit-theme.html in scap repo [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) [00:46:43] (03CR) 10Dzahn: "was this in response to an incident or a ticket that made you want to change the number?" [puppet] - 10https://gerrit.wikimedia.org/r/439645 (owner: 10Paladox) [00:48:41] (03PS1) 10Reedy: Disable EducationProgram everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447945 (https://phabricator.wikimedia.org/T125618) [00:48:43] (03PS1) 10Reedy: Undeploy EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) [00:51:58] (03CR) 10Dzahn: "i dont see it but compiler says syntax error" [puppet] - 10https://gerrit.wikimedia.org/r/439645 (owner: 10Paladox) [00:52:33] (03PS5) 10Paladox: phabricator: Make phd.taskmasters configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/439645 [00:55:47] (03CR) 10Dzahn: "yup, you fixed it. http://puppet-compiler.wmflabs.org/11868/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/439645 (owner: 10Paladox) [00:58:00] (03CR) 10Dzahn: [C: 032] phabricator: Make phd.taskmasters configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/439645 (owner: 10Paladox) [00:58:22] thanks :) [00:59:17] performance tuning tests will be good, thx [00:59:37] re: number of phd task masters.. to optimize phab performance [01:00:46] yep [01:03:19] (03PS9) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) [01:11:27] (03PS1) 10Andrew Bogott: wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 [01:12:05] (03CR) 10jerkins-bot: [V: 04-1] wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 (owner: 10Andrew Bogott) [01:13:52] (03PS2) 10Andrew Bogott: wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 [01:16:58] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10Base) Erm, so I guess goodbye Cat-a-lot and VFC? Why was this decision made despite what @Dereckson wrote above? [01:17:35] (03CR) 10Dzahn: "yea, we could probably just set phab1002 temporarily to the failover server in Hiera to get the rsync going" [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:17:51] (03PS1) 10Paladox: phabricator: Set phabricator_server_failover to phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) [01:18:02] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10Bugreporter) They should be rewrited to add a limit of no more than 90 edits per minute. [01:18:19] (03PS2) 10Paladox: phabricator: Set phabricator_server_failover to phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) [01:20:32] (03PS3) 10Andrew Bogott: wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 [01:23:37] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10Base) This would kill their purpose, some deal with thousands of pages, rewriting them with such throttle would make them unusable for a... [01:24:08] (03CR) 10Dzahn: [C: 031] "yep, this would be to allow rsyncing data over to the temp server 1002 .. instead of the codfw server 2001" [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:28:49] (03PS4) 10Andrew Bogott: wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 [01:32:14] !log phab1001 - rm /usr/local/sbin/sync-srv-repos that has a reference to non-existing server iridium.eqiad.wmnet (formerly phab) (T190568) [01:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:18] T190568: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 [01:32:24] (03PS1) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) [01:32:36] * Reedy barfs [01:33:29] (03CR) 10Andrew Bogott: [C: 032] wmcs puppetmaster: allow second region designate to talk to the enc API [puppet] - 10https://gerrit.wikimedia.org/r/447948 (owner: 10Andrew Bogott) [01:33:37] (03CR) 10jerkins-bot: [V: 04-1] Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy) [01:34:16] (03PS2) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) [01:34:46] (03CR) 10Dzahn: [C: 032] "on the passive server rsync gets added to _pull_ from the active server, not the other way around. so this allows phab1002 to pull from ph" [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:35:02] (03PS3) 10Dzahn: phabricator: Set phabricator_server_failover to phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:41:37] (03CR) 10Dzahn: [C: 032] "on phab1001 (active server): firewall was adjusted by ferm to let phab1002 in" [puppet] - 10https://gerrit.wikimedia.org/r/447949 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:42:20] !log phab1002 - starting rsync of repo data from phab1001 in a screen session after gerrit:447949 T190568 [01:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:24] T190568: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 [01:43:50] (03CR) 10Dzahn: "probably not needed anymore after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/447949/ and we are syncing now to phab1002 in a s" [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:44:02] (03Abandoned) 10Paladox: phabricator: Add new var phabricator_server_new [puppet] - 10https://gerrit.wikimedia.org/r/441384 (https://phabricator.wikimedia.org/T190568) (owner: 10Paladox) [01:46:04] (03PS5) 10Paladox: dumps: add phab1002 as second phab server [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:48:10] (03PS6) 10Paladox: dumps: add phab1002 as second phab server [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:48:12] (03PS2) 10Paladox: switch phabricator from phab1001 to phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:48:40] (03CR) 10Paladox: switch phabricator from phab1001 to phab1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:51:01] (03CR) 10Dzahn: switch phabricator from phab1001 to phab1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:54:50] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447935 (owner: 10Dzahn) [01:54:58] (03PS2) 10Dzahn: mediawiki_maintenance: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447935 [01:55:00] (03CR) 10Paladox: "This broke using phab on the vps" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [01:55:17] (03PS1) 10Andrew Bogott: wmcs puppet enc api: add ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/447955 [01:57:33] (03CR) 10Dzahn: "how about creating that dummy repo that was mentioned above?" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [01:59:29] (03CR) 10Dzahn: [C: 032] "noop on mwmaint* and terbium" [puppet] - 10https://gerrit.wikimedia.org/r/447935 (owner: 10Dzahn) [02:01:08] (03PS2) 10Dzahn: maps: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447937 [02:01:12] (03CR) 10Paladox: "Nope, dosen't work." [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:03:34] (03CR) 10Dzahn: "maybe then put the whole list of libraries into Hiera?" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:07:06] (03CR) 10Dzahn: "It would just have to be a list of the names "Sprint, security, misc, ava, translations". All the rest is identical (${phab_root_dir}/libe" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:08:19] (03CR) 1020after4: [C: 031] "should work now. the submodule checkout (.git bare repo) has to be in deployment/.git/modules/libext/ava instead of directly in libext/a" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:09:00] (03CR) 1020after4: [C: 031] "or what dzahn said - hiera" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:12:15] (03CR) 10Paladox: "> should work now. the submodule checkout (.git bare repo) has to be" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T199741) (owner: 1020after4) [02:12:20] (03PS1) 10Andrew Bogott: Add ipv6 IPs to labservices boxes [puppet] - 10https://gerrit.wikimedia.org/r/447966 [02:14:22] (03PS2) 10Andrew Bogott: Add ipv6 IPs to labservices boxes [puppet] - 10https://gerrit.wikimedia.org/r/447966 [02:14:24] (03PS2) 10Andrew Bogott: wmcs puppet enc api: add ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/447955 [02:15:22] (03CR) 10Andrew Bogott: [C: 032] Add ipv6 IPs to labservices boxes [puppet] - 10https://gerrit.wikimedia.org/r/447966 (owner: 10Andrew Bogott) [02:24:57] (03PS8) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [02:24:59] (03PS1) 10Andrew Bogott: Added ipv6 dns for labservices and labtestservices boxes [dns] - 10https://gerrit.wikimedia.org/r/447967 [02:25:10] (03CR) 10jerkins-bot: [V: 04-1] Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) (owner: 10Andrew Bogott) [02:25:13] (03CR) 10jerkins-bot: [V: 04-1] Added ipv6 dns for labservices and labtestservices boxes [dns] - 10https://gerrit.wikimedia.org/r/447967 (owner: 10Andrew Bogott) [02:27:41] (03PS2) 10Andrew Bogott: Added ipv6 dns for labservices and labtestservices boxes [dns] - 10https://gerrit.wikimedia.org/r/447967 [02:28:27] (03CR) 10Andrew Bogott: [C: 032] Added ipv6 dns for labservices and labtestservices boxes [dns] - 10https://gerrit.wikimedia.org/r/447967 (owner: 10Andrew Bogott) [02:29:59] Something going on with phab? [02:30:26] I got an intermittant sql error [02:30:31] " Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [02:30:31] " [02:31:01] twentyafterfour: ^ [02:31:10] Even its error messages have obnoxious snark ;) [02:32:51] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.13) (duration: 13m 40s) [02:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:30] (03PS3) 10Andrew Bogott: wmcs puppet enc api: add ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/447955 [02:36:02] (03CR) 10Andrew Bogott: [C: 032] wmcs puppet enc api: add ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/447955 (owner: 10Andrew Bogott) [02:36:23] PROBLEM - https://phabricator.wikimedia.org on phab1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 4653 bytes in 3.090 second response time [02:37:32] RECOVERY - https://phabricator.wikimedia.org on phab1002 is OK: HTTP OK: HTTP/1.1 200 OK - 35417 bytes in 1.429 second response time [02:38:32] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 15.98 seconds [02:39:21] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.62 seconds [02:49:06] Seeing the same phab issue. [02:50:17] !log phabricator.wikimedia.org is spewing HTTP 500 ("Can Not Connect to MySQL") [02:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:48] Krinkle: Reedy etc I'm calling mukunda now [02:51:02] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 4653 bytes in 3.110 second response time [02:51:55] he's coming [02:52:12] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 35416 bytes in 0.409 second response time [02:52:23] hmm [02:52:54] everything seems good now. I don't see anything wrong on the phab side at all [02:53:27] http://paste.debian.net/1035194/ [02:53:31] was what I saw ^ [02:55:00] I just saw it. [02:55:11] Works now. [02:55:47] I had an ongoing batch job which got aborted but I never lost connection to phab1001 [02:55:55] maybe a network hiccup? [02:56:06] A Troublesome Encounter! [02:56:06] Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [02:56:11] the http server was responsive, showing a blue screen with a basic error message ^ [02:56:34] weird how I got a different but similar error [02:56:42] what Krinkle said [02:57:04] i got it too [02:57:18] https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=300s&orgId=1&var-server=phab1002 and https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=300s&orgId=1&var-server=phab1001 [02:57:37] do look interesting. something peaked its memory from 9GB to 37 GB in 20min [02:57:55] https://grafana.wikimedia.org/dashboard/db/phabricator?orgId=1&from=now-6h&to=now [02:58:06] Krinkle phab1001 [02:58:06] :) [02:58:15] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104&from=now-24h&to=now [02:58:16] phab1002 is the temp replacement (and is running rsync) [02:58:25] look at the QPS spike on m3 [02:58:43] is that used by phab? [02:58:58] I looked in puppet/hiera but it appears as if phab has a localhost db [02:59:02] but that did seem unlikely indeed [02:59:04] https://wikitech.wikimedia.org/wiki/Phabricator said m3 [02:59:12] k [02:59:16] phab does not use localhost db except on labs [02:59:22] er cloud [02:59:41] "phab1002 - starting rsync of repo data from phab1001 in a screen session after gerrit:447949 T190568" is a recent log message [02:59:42] T190568: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 [03:00:04] rsync shoulden't affect it [03:00:16] emphasis on 'should' [03:00:58] but yeah, if phab1002 wasn't serving traffic, then it going OOM is probably not related right now. [03:01:01] I think something is DoS'ing m3 [03:01:04] yeah [03:01:20] it was also having replication alerts earlier [03:01:28] anyway, gotta go [03:01:29] o/ [03:01:33] probably max connections on mysql are hit so phab can't get a connection [03:02:02] twentyafterfour: would the upgrade have resulted in a lot of writes to the database? [03:02:06] (or reads) [03:03:04] legoktm: yes I've been running a database migration [03:03:16] it's not terribly intensive but it's going through all commits [03:03:29] Maybe it's not closing its connections and creating a new con for each query? [03:03:36] possible [03:03:50] it's also causing replag [03:03:51] * twentyafterfour reads through the code [03:04:48] yeah, a sleep(1) every 1000 affected rows might help, or a more surgical way to wait for just in time replication [03:05:23] I'll stop it for now, it isn't time critical and it's resumable [03:05:35] I can add something to slow it down [03:05:40] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.14) (duration: 15m 13s) [03:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:31] !log stopped phabricator rebuild-identities migration as it's suspected of causing database connection exhaustion and replag [03:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:48] it doesn't look like the migration is creating a bunch of connections but it could be overly taxing the master [03:16:04] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Jul 26 03:16:04 UTC 2018 (duration 10m 24s) [03:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10bd808) [04:08:26] !log kartik@deploy1001 Started deploy [cxserver/deploy@3c99775]: Update cxserver to e98c81vb (T200283) [04:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:30] T200283: Adaptation fails for inline images without caption - https://phabricator.wikimedia.org/T200283 [04:13:38] !log kartik@deploy1001 Finished deploy [cxserver/deploy@3c99775]: Update cxserver to e98c81vb (T200283) (duration: 05m 12s) [04:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:42] T200283: Adaptation fails for inline images without caption - https://phabricator.wikimedia.org/T200283 [04:53:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447977 [04:55:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447977 (owner: 10Marostegui) [04:57:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447977 (owner: 10Marostegui) [04:57:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447977 (owner: 10Marostegui) [05:02:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 (duration: 01m 05s) [05:02:39] !log Deploy schema change on db1101:3317 T144010 T51190 T199368 [05:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:46] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:02:47] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:02:47] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:07:21] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0 [05:07:31] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [05:07:32] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [06:33:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447979 [06:35:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447979 (owner: 10Marostegui) [06:37:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447979 (owner: 10Marostegui) [06:38:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447979 (owner: 10Marostegui) [06:38:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 (duration: 00m 55s) [06:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447981 [06:42:17] (03PS8) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [06:44:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447981 (owner: 10Marostegui) [06:45:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447981 (owner: 10Marostegui) [06:46:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 00m 55s) [06:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:27] !log Deploy schema change on db1086 T144010 T51190 T199368 [06:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:32] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:47:33] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:47:33] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:49:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447981 (owner: 10Marostegui) [06:49:42] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [06:49:50] (03CR) 10Jcrespo: [C: 04-1] "This is not yet ok, grants are not in sync with the existing ones- see mwmaint2001 missing ones." [puppet] - 10https://gerrit.wikimedia.org/r/431042 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [06:50:18] (03CR) 10Jcrespo: [C: 04-1] "Also, references to wasat, which no longer exists as a domain." [puppet] - 10https://gerrit.wikimedia.org/r/431042 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [06:51:02] (03PS9) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 [06:53:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [06:54:01] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:10:17] (03PS1) 10Jcrespo: mariadb: Depool es1015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447982 [07:12:58] (03PS10) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) [07:15:45] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10elukey) [07:16:54] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447982 (owner: 10Jcrespo) [07:18:03] (03Merged) 10jenkins-bot: mariadb: Depool es1015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447982 (owner: 10Jcrespo) [07:18:16] (03CR) 10jenkins-bot: mariadb: Depool es1015 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447982 (owner: 10Jcrespo) [07:19:42] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1015 (duration: 00m 55s) [07:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447983 [07:24:17] !log stop and reimage es1015 [07:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:55] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447983 [07:29:16] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10elukey) Adding some comments about all the hosts in which I am listed in: * aqs1008 - would just need to be depooled via pybal before proceeding * dr... [07:29:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447983 (owner: 10Marostegui) [07:30:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447983 (owner: 10Marostegui) [07:31:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447983 (owner: 10Marostegui) [07:32:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 (duration: 00m 54s) [07:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447984 [07:39:19] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10elukey) For the mediawiki hosts: mw12[84-90] - APIs Row B mw129[3-6] - videoscalers mw1297 - ?? - I don't see anything related to this host in pupp... [07:42:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447984 (owner: 10Marostegui) [07:43:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447984 (owner: 10Marostegui) [07:43:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447984 (owner: 10Marostegui) [07:44:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 (duration: 00m 54s) [07:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:36] (03PS1) 10Jcrespo: mariadb: Repool es1015 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447986 [07:53:19] (03PS11) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) [07:57:14] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10Aklapper) >>! In T56515#4452092, @Base wrote: > rewriting them with such throttle would make them unusable for any practical purposes.... [07:57:35] !log joal@deploy1001 Started deploy [analytics/refinery@9390b63]: Regular weekly deploy of Analytics-Hadoop scripts - try 2 [07:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:52] 10Operations, 10Traffic: Provide a CI container with pebble - https://phabricator.wikimedia.org/T200405 (10Vgutierrez) p:05Triage>03Normal [07:58:57] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1015 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447986 (owner: 10Jcrespo) [07:59:45] (03PS12) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) [08:00:31] (03Merged) 10jenkins-bot: mariadb: Repool es1015 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447986 (owner: 10Jcrespo) [08:00:41] (03CR) 10jenkins-bot: mariadb: Repool es1015 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447986 (owner: 10Jcrespo) [08:01:12] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10revi) I thought security team and autopatrolled users on Commons has reached a compromise — {T194864} [08:02:50] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:03:30] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:03:50] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:04:30] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:06:06] (03PS13) 10Elukey: [WIP] Refactor Hadoop code to allow more than one cluster in Prod [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) [08:10:35] (03PS3) 10Ema: package_builder: install haveged [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) [08:10:40] (03CR) 10Elukey: "Pcc looks good! https://puppet-compiler.wmflabs.org/compiler02/11879/" [puppet] - 10https://gerrit.wikimedia.org/r/447813 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [08:10:50] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1015 with low load after reimage (duration: 00m 55s) [08:10:51] (03PS2) 10Mobrovac: JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) [08:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:07] (03CR) 10Ema: [C: 032] package_builder: install haveged [puppet] - 10https://gerrit.wikimedia.org/r/447763 (https://phabricator.wikimedia.org/T200307) (owner: 10Ema) [08:13:43] jynus: marostegui: can i take over deploy1001 for 5 minutes to dpeloy a mw-config change? [08:14:01] mobrovac: fine for me! [08:14:42] (03CR) 10Mobrovac: [C: 032] JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [08:15:51] (03Merged) 10jenkins-bot: JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [08:15:53] (03CR) 10Gehel: "The build logs for puppet make it really hard to see what is failing. In this case, it is a formatting rule:" [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [08:16:05] marostegui: don't trust Marko! :D [08:18:10] (03CR) 10jenkins-bot: JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [08:18:25] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Set readOnlyReason to false everywhere for JobQueueEventBus - T199594 (duration: 00m 55s) [08:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] T199594: Exception "Job queue is read-only" - https://phabricator.wikimedia.org/T199594 [08:18:38] haha [08:18:45] ok i'm done marostegui [08:18:47] thnx [08:18:51] \o/ [08:20:00] !log Deploy schema change on db1090:3317 T144010 T51190 T199368 [08:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:06] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [08:20:06] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:20:07] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [08:25:19] (03PS1) 10Jcrespo: mariadb: Repool es1015 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447999 [08:26:38] !log joal@deploy1001 Finished deploy [analytics/refinery@9390b63]: Regular weekly deploy of Analytics-Hadoop scripts - try 2 (duration: 29m 02s) [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:40] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:28:10] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:29:51] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:30:30] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:31:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) [08:40:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) @ayounsi I am running tcpdump on stat1005 with ' ip6 and src 2620:0:861:108:10:64:53:30` on stat1005 but I see only traffic to... [08:44:30] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 61.03, 38.89, 21.72 [08:46:21] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.30, 37.70, 22.70 [08:47:00] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 62.75, 47.16, 29.11 [08:49:00] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 59.36, 45.20, 28.69 [08:54:28] !log restarting hhvm at mw1230 [08:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:56] !log restarting hhvm at mw1226, mw1276, mw1281 [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448002 [09:02:16] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448002 [09:02:40] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 17.43, 27.32, 29.87 [09:03:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448002 (owner: 10Marostegui) [09:05:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448002 (owner: 10Marostegui) [09:05:11] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 12.04, 22.07, 29.24 [09:06:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 (duration: 00m 55s) [09:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448002 (owner: 10Marostegui) [09:09:20] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 6.15, 12.29, 22.80 [09:14:10] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 5.86, 10.15, 22.92 [09:18:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448003 [09:20:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448003 (owner: 10Marostegui) [09:21:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448003 (owner: 10Marostegui) [09:21:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448003 (owner: 10Marostegui) [09:22:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 55s) [09:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:53] !log Deploy schema change on db1079 with replication, this will generate lag on labsdb:s7 T144010 T51190 T199368 [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:59] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [09:22:59] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [09:23:00] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [09:27:48] !log rolling restart of elasticsearch / cirrus / eqiad to disable G1 - T156137 [09:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:52] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [09:34:02] (03PS2) 10Jcrespo: mariadb: Repool es1015 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447999 [09:36:45] looks like the train is no longer blocked, if nobody has complaints, I'll push .14 to group 1 in a few minutes T191060 [09:36:45] T191060: 1.32.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T191060 [09:37:32] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1015 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447999 (owner: 10Jcrespo) [09:37:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448004 [09:39:05] (03Merged) 10jenkins-bot: mariadb: Repool es1015 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447999 (owner: 10Jcrespo) [09:39:19] (03CR) 10jenkins-bot: mariadb: Repool es1015 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447999 (owner: 10Jcrespo) [09:40:31] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1015 fully (duration: 00m 55s) [09:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:39] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448004 [09:43:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448004 (owner: 10Marostegui) [09:43:40] 10Operations, 10Analytics, 10Documentation: Remove data from Hadoop's HDFS as part of the user offboard workflow - https://phabricator.wikimedia.org/T200312 (10elukey) [09:44:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448004 (owner: 10Marostegui) [09:44:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448004 (owner: 10Marostegui) [09:44:53] (03PS2) 10Ema: vcl: add cluster_{fe,be}_vcl_switch hooks [puppet] - 10https://gerrit.wikimedia.org/r/447836 (https://phabricator.wikimedia.org/T164609) [09:45:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 (duration: 00m 55s) [09:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:44] (03CR) 10Ema: [C: 032] vcl: add cluster_{fe,be}_vcl_switch hooks [puppet] - 10https://gerrit.wikimedia.org/r/447836 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:46:54] (03PS1) 10Zfilipin: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448006 [09:46:57] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448006 (owner: 10Zfilipin) [09:48:13] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448006 (owner: 10Zfilipin) [09:49:11] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448006 (owner: 10Zfilipin) [09:49:19] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14 [09:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:13] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.14 (duration: 00m 53s) [09:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:18] argh, what's this now: [09:54:24] `Fatal error: Uncaught exception 'ExtensionDependencyError' with message 'PageTriage requires ORES to be installed. ' in /srv/mediawiki/php-1.32.0-wmf.14/includes/registration/ExtensionRegistry.php:279 Stack trace: #0 /srv/mediawiki/php-1.32.0-wmf.14/inclu` [09:54:32] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10TheDJ) I think we are mixing up some things here: 1: @Johan please note that this issue was already resolved quite a while ago (in May),... [09:58:20] similar to T200043 [09:58:21] T200043: PageTriage role on Vagrant should have ORES as dependency - https://phabricator.wikimedia.org/T200043 [09:59:48] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10revi) I was merely informing @base that his concern was (mostly) resolved. I already know the behind the scenes stuff that caused expedi... [10:03:17] (03CR) 10Ema: [V: 032 C: 032] "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/447755 (owner: 10Ema) [10:07:50] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s7 populateChangeTagDef.php --sleep 2 (T193873) [10:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:54] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [10:08:58] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.32.0-wmf.14" [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:56] (03PS1) 10Zfilipin: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448009 (https://phabricator.wikimedia.org/T191060) [10:10:37] (03CR) 10Zfilipin: [C: 032] Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448009 (https://phabricator.wikimedia.org/T191060) (owner: 10Zfilipin) [10:11:57] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448009 (https://phabricator.wikimedia.org/T191060) (owner: 10Zfilipin) [10:12:13] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448009 (https://phabricator.wikimedia.org/T191060) (owner: 10Zfilipin) [10:15:36] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Patch-For-Review: Jenkins builds using autopkgtest stuck at "Not enough random bytes available" - https://phabricator.wikimedia.org/T200307 (10ema) 05Open>03Resolved a:03ema Installing haveged on the build slaves did the trick, see s... [10:15:57] zeljkof: I know what's going on and I can fix it [10:16:09] do you want me to make the patch and fix and deploy? [10:16:12] Amir1: I was just about to ping you [10:16:13] please do [10:16:42] Amir1: yes, please fix it so I can deploy .14 to group 1 as soon as possible [10:16:49] sure [10:17:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Patch-For-Review: Jenkins builds using autopkgtest stuck at "Not enough random bytes available" - https://phabricator.wikimedia.org/T200307 (10ema) [10:18:06] (03PS1) 10Urbanecm: Upload HD logos for hdwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448013 (https://phabricator.wikimedia.org/T200296) [10:19:11] (03PS1) 10Urbanecm: Use HD logos in hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448014 (https://phabricator.wikimedia.org/T200296) [10:19:52] (03PS2) 10Urbanecm: Upload HD logos for hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448013 (https://phabricator.wikimedia.org/T200296) [10:23:41] PROBLEM - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused [10:23:42] (03PS1) 10Ladsgroup: Enable ORES on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448015 (https://phabricator.wikimedia.org/T200412) [10:24:10] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:24:31] !log cache-text: upgrade to varnish 5.1.3-1wm9 and apply alternate domains patch T164609 [10:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:35] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [10:24:51] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.108 second response time [10:25:20] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [10:26:49] !log ladsgroup@mwmaint1001:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=test2wiki ores (T200412) [10:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:52] T200412: PageTriage requires ORES to be installed - https://phabricator.wikimedia.org/T200412 [10:27:12] (03CR) 10Ladsgroup: [C: 032] Enable ORES on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448015 (https://phabricator.wikimedia.org/T200412) (owner: 10Ladsgroup) [10:28:27] (03Merged) 10jenkins-bot: Enable ORES on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448015 (https://phabricator.wikimedia.org/T200412) (owner: 10Ladsgroup) [10:28:41] (03CR) 10jenkins-bot: Enable ORES on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448015 (https://phabricator.wikimedia.org/T200412) (owner: 10Ladsgroup) [10:30:57] (03CR) 10Volans: [C: 04-1] "Thanks for tackling this, I've few comments, see inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [10:31:54] it works fine on mwdebug1002 and it doesn't break enwiki [10:31:58] so moving forward [10:33:25] nice :) [10:33:35] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:448015|Enable ORES on test2wiki (T200412)]] (duration: 00m 55s) [10:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] T200412: PageTriage requires ORES to be installed - https://phabricator.wikimedia.org/T200412 [10:33:44] (03CR) 10Volans: "Some question/comment inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [10:35:41] zeljkof: https://phabricator.wikimedia.org/T200412#4452746 [10:39:47] Amir1, ehm, SWAT? I'm confused :) [10:40:13] (in comment given for SWAT you wrote SWAT) [10:40:32] Urbanecm: oops my bad [10:40:33] (03PS1) 10Zfilipin: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448018 [10:40:35] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448018 (owner: 10Zfilipin) [10:40:37] I'm sorry [10:41:16] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10TheDJ) @revi got it 👍 [10:41:40] Ok, I scheduled 6 patches for SWAT and I'm waiting for it, so I'm hoping I remembered SWAT time correctly this time :D [10:42:06] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448018 (owner: 10Zfilipin) [10:42:18] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448018 (owner: 10Zfilipin) [10:42:31] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi It has been decided at the SRE weekly meeting to leave the deprecation page up indefinitely instead o... [10:43:22] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) [10:43:24] 10Operations, 10monitoring, 10Privacy, 10Security-Core: status.wikimedia.org should not load Google Analytics - https://phabricator.wikimedia.org/T115945 (10fgiunchedi) 05Open>03Invalid Parent task resolved! [10:43:37] 10Operations, 10monitoring, 10Privacy, 10Security: status.wikimedia.org should have an alternative privacy policy - https://phabricator.wikimedia.org/T189763 (10fgiunchedi) 05Open>03Invalid Parent task resolved! [10:43:39] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) [10:43:54] 10Operations, 10monitoring: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530 (10fgiunchedi) 05Open>03Invalid Parent task resolved! [10:43:56] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) [10:44:10] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 5 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 (10fgiunchedi) [10:44:13] 10Operations: status.wikimedia.org should use some Wikimedia favicon if possible - https://phabricator.wikimedia.org/T134458 (10fgiunchedi) 05Open>03Invalid Parent task resolved! [10:44:17] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10fgiunchedi) [10:45:29] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14 [10:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.14 (duration: 00m 54s) [10:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:42] Amir1: it's deployed, I'm staring at the logs [10:46:58] Cool [10:50:06] (03PS1) 10Vgutierrez: Provide a valid pebble config for the integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/448020 (https://phabricator.wikimedia.org/T200405) [10:50:44] Amir1: looks good! I'll resolve the task [10:50:49] Amir1: thanks! [10:51:40] :) [10:52:54] (03PS2) 10Vgutierrez: Provide a valid pebble config for the integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/448020 (https://phabricator.wikimedia.org/T200405) [10:57:00] (03PS3) 10Vgutierrez: Provide a valid pebble config for the integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/448020 (https://phabricator.wikimedia.org/T200405) [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1100). [11:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] Here :) [11:01:29] !log ladsgroup@mwmaint1001:~$ mwscript sql.php --wiki=labtestwiki /srv/mediawiki-staging/php-1.32.0-wmf.14/maintenance/archives/patch-change_tag_def.sql [11:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:51] I can SWAT today [11:02:10] Good! Ping me when needed :) [11:02:26] !log ladsgroup@mwmaint1001:~$ foreachwikiindblist wikitech populateChangeTagDef.php (T193873) [11:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:30] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [11:02:34] Urbanecm: will do, as soon as the first patch is at mwdeub [11:02:36] debug [11:08:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:08:12] As always, nothing has changed :) [11:08:50] Urbanecm: what do you mean? can you test 440092? [11:08:54] (03PS1) 10Gergő Tisza: Revert "wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 [11:09:11] I mean, that you will ping me when patch is on mwdebug, as you always did :D [11:09:13] Nothing important [11:09:26] Urbanecm: ah :) yes, the standard procedure [11:09:33] (03PS2) 10Gergő Tisza: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) [11:09:36] (03CR) 10Zfilipin: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:09:44] (03PS6) 10Zfilipin: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:09:55] zeljkof, I can test only partially (on checkuserwiki), not on other private wikis [11:09:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:11:12] (03Merged) 10jenkins-bot: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:11:29] (03CR) 10jenkins-bot: Don't assign sysop-level privileges to bureaucrats explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440092 (https://phabricator.wikimedia.org/T197095) (owner: 10Urbanecm) [11:12:19] Urbanecm: 440092 is at mwdebug [11:12:23] ack [11:12:33] zeljkof: qq: is wmf.14 being rolled out today everywhere? [11:12:45] (03PS3) 10Gergő Tisza: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) [11:13:08] mobrovac: if there are no blockers, yes, today during train window [11:13:15] kk cool [11:13:16] thnx [11:13:17] and there are no blockers so far [11:13:18] zeljkof, please deploy everywhere [11:13:24] Urbanecm: will do [11:13:29] Thanks [11:13:34] (03PS4) 10Gergő Tisza: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) [11:14:14] (03PS3) 10Zfilipin: Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [11:14:49] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:440092|Dont assign sysop-level privileges to bureaucrats explicitly (T197095)]] (duration: 00m 56s) [11:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] T197095: Don't assign sysop-level privileges to bureaucrats explicitly - https://phabricator.wikimedia.org/T197095 [11:15:02] Urbanecm: 440092 is deployed [11:15:05] ack [11:16:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [11:17:52] (03Merged) 10jenkins-bot: Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [11:18:58] Urbanecm: 445541 at mwdebug [11:19:02] ack [11:19:48] zeljkof, working, please deploy [11:19:53] Urbanecm: ok [11:20:19] (03PS9) 10Zfilipin: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [11:20:51] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445541|Fix unexpected space in pswikis ExtraNamespaces definition (T199480)]] (duration: 00m 55s) [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:55] T199480: Unexpected space in ExtraNamespaces definitions for pswiki - https://phabricator.wikimedia.org/T199480 [11:21:01] Urbanecm: 445541 deployed [11:21:06] ack [11:21:07] (03CR) 10jenkins-bot: Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [11:23:13] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [11:24:08] (03PS4) 10Urbanecm: Support zh-my localized URLs in Wikipedia virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) [11:24:24] (03Merged) 10jenkins-bot: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [11:25:51] (03CR) 10jenkins-bot: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [11:26:00] godog, moritzm, _joe_: Hi, how should I get https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443215/ deployed? Should I add this to Puppet SWAT? Or wait? Something else? [11:26:28] Urbanecm: 441839 at mwdebug [11:26:33] ok [11:26:56] Urbanecm: I think puppet swat is a good way forward for that [11:27:13] but then, I've never participated at puppet swat :) [11:27:30] zeljkof, I thought it as well, but "Changes to the HHVM or Apache configuration for the MediaWiki application server cluster are not eligible for SWAT, due to the potentially far reaching impact / unavailability" is in https://wikitech.wikimedia.org/wiki/PuppetSWAT [11:27:57] So I'm wondering if I understood this wrongly or this indeed cannot be in puppet SWAT [11:28:26] zeljkof, please deploy 441839 [11:28:31] hm, well, I don't really know [11:28:35] Urbanecm: ok, deploying [11:29:16] Yup, me too, that's why I asked "puppet SWATters". It can be both :) [11:30:02] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:441839|Create TemplateEditor group on enwikivoyage (T198056)]] (duration: 00m 55s) [11:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:06] T198056: Template editor group required at Wikivoyage - https://phabricator.wikimedia.org/T198056 [11:30:12] Urbanecm: 441839 deployed [11:30:15] ack [11:30:24] (03CR) 10Reedy: [C: 04-1] "You'll need to copy this into environments/mediawiki_test/modules/mediawiki_exp/files/apache/sites too" [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) (owner: 10Urbanecm) [11:30:39] Urbanecm: so 445537 can not be tested? [11:30:49] 10Operations, 10Core-Platform-Team, 10WMF-JobQueue, 10MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), and 3 others: Exception "Job queue is read-only" - https://phabricator.wikimedia.org/T199594 (10mobrovac) a:03mobrovac These errors should disappear shortly once `wmf.14` is deployed. Keep... [11:30:53] (03PS10) 10Zfilipin: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [11:30:56] zeljkof, no [11:31:02] (no, it cannot be tested) [11:32:28] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [11:34:00] (03Merged) 10jenkins-bot: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [11:36:01] 10Operations, 10Wikimedia-Apache-configuration, 10Chinese-Sites, 10Patch-For-Review, 10User-Urbanecm: All "zh-my" variant page views get 404 Not Found on zh.wikipedia.org - https://phabricator.wikimedia.org/T198371 (10Urbanecm) Well, all puppet patches I ever pushed to Gerrit were deployed by @Dzahn afte... [11:37:11] !log zfilipin@deploy1001 Synchronized tests/InitialiseSettingsTest.php: SWAT: [[gerrit:445537|Test spaces in ExtraNamespaces (T199162)]] (duration: 00m 55s) [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] T199162: Test spaces in ExtraNamespaces - https://phabricator.wikimedia.org/T199162 [11:37:24] Urbanecm: 445537 deployed [11:37:42] ack [11:38:12] (03CR) 10jenkins-bot: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [11:40:05] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448013 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:41:15] Reedy, can you help me how should I "copy this into environments/mediawiki_test/modules/mediawiki_exp/files/apache/sites too" in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443215/? [11:41:25] (03Merged) 10jenkins-bot: Upload HD logos for hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448013 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:41:25] Literally as it says [11:41:28] Look at the other path [11:41:34] Copy the change into the same file in that path [11:42:42] There is no Wikipedia vhost in the other file [11:42:57] There are only includes [11:43:04] (03CR) 10jenkins-bot: Upload HD logos for hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448013 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:43:20] Urbanecm: 448013 at mwdebug [11:43:37] There might be something relevant in wikipedia.conf being linked by main.conf in the other path, but I cannot find where it is [11:43:41] thank you zeljkof [11:44:08] zeljkof, please deploy 448013 [11:44:23] environments/mediawiki_test/modules/mediawiki_exp/templates/apache/sites/included/wikipedia.org.conf.erb [11:45:29] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:448013|Upload HD logos for hewikiquote (T200296)]] (duration: 00m 54s) [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:32] T200296: SVG logo update for Wikiquote - https://phabricator.wikimedia.org/T200296 [11:46:03] (03PS5) 10Urbanecm: Support zh-my localized URLs in Wikipedia virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) [11:46:15] So it should be like I updated it ^^? [11:46:32] Urbanecm: 448013 deployed [11:46:36] Thank you zeljkof [11:46:52] (03CR) 10Reedy: [C: 031] Support zh-my localized URLs in Wikipedia virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) (owner: 10Urbanecm) [11:46:58] (OT: I'd like to know why there should be two files that are same) [11:47:28] (03PS2) 10Zfilipin: Use HD logos in hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448014 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:47:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448014 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:47:43] Something to do with testin infrastructure [11:47:50] mwdebug* etc [11:47:59] And part of a rewrite/refactoring I think [11:48:27] Cannot we just use symlinks? [11:48:43] Anyway, thank you for your review. What I should do with it next? [11:48:50] All the files aren't the same... So no [11:49:13] (03Merged) 10jenkins-bot: Use HD logos in hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448014 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:49:28] (03CR) 10jenkins-bot: Use HD logos in hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448014 (https://phabricator.wikimedia.org/T200296) (owner: 10Urbanecm) [11:49:42] Ok [11:49:54] I need to poke someone again to do the wikimania vhost. Can probably get yours done too [11:50:06] That would be great! [11:50:41] Urbanecm: 448014 at mwdebug [11:50:46] ack [11:52:19] zeljkof, can you purge https://en.wikipedia.org/static/images/project-logos/hewikiquote-2x.png and https://en.wikipedia.org/static/images/project-logos/hewikiquote-1.5x.png please? Those URLs are 404 without mwdebug on my side [11:52:39] Urbanecm: sure, did not know we need to purge new files.. [11:54:14] Urbanecm: generally apache patches aren't suitable for puppet swat, though it seems simple enough [11:54:28] Urbanecm: purged [11:54:53] heh [11:55:14] * Reedy tags godog as it [11:55:19] Thank you zeljkof, please deploy 448014 [11:55:30] * godog looks at the cookie he just licked [11:55:46] Urbanecm: add it today's puppet SWAT, I'll merge it [11:55:51] godog: There's also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446786/ if you wouldn't mind [11:55:55] godog, okay, will do [11:56:01] elukey was going to do it earlier in the week, but I forgot to chase him [11:56:26] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:448014|Use HD logos in hewikiquote (T200296)]] (duration: 00m 52s) [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:30] T200296: Add HD logos for hewikiquote - https://phabricator.wikimedia.org/T200296 [11:56:43] Reedy: ack, I'll ask elukey to take a look too [11:56:44] Urbanecm: 448014 deployed [11:56:57] Urbanecm: please check and thanks for deploying with #releng ;) [11:57:02] Thank you zeljkof [11:57:04] !log EU SWAT finished [11:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] Urbanecm: Depending on when they get merged, I'll see if I can find a wiki create window [11:57:42] Thank you Reedy. [11:58:36] I'm wondering if https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446786/ should be added to puppet SWAT as well... [11:59:22] Urbanecm: I just asked that, scroll up :D [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1200) [12:00:33] Then I don't get the answer :D [12:01:27] yes please add that to puppet swat as well :) [12:06:12] * addshore investigates odd wikidata dispatching happen, seems to have a bunch of locks for no reason [12:13:23] zeljkof: 09:50 zfilipin@deploy1001: Synchronized php: group1 wikis to 1.32.0-wmf.14 (duration: 00m 53s) [12:13:26] that included wikidatawiki right? [12:14:15] Would look like it [12:14:19] it's def .14 [12:14:27] can we roll that back? [12:14:35] just for wikidata? [12:14:37] I can't tihnk of anything else that is causing these dispatching issues [12:15:11] wait [12:15:18] let me double check timestamps [12:15:40] im looking for something around 11:00 UTC today [12:16:23] The only other stuff is Amir1 doing stuff to do with change tags on mwmaint1001 [12:16:36] 10:46 zfilipin@deploy1001: Synchronized php: group1 wikis to 1.32.0-wmf.14 (duration: 00m 54s) [12:16:46] addshore: go ahead [12:16:56] addshore: yup, I'm running some maint script atm [12:16:59] one on wikidata [12:17:00] my plan is to move .14 to everywhere in an hour [12:17:13] Amir1: but that shouldn't be causing dispatching locks to be broken? :P [12:17:27] Amir1: https://phabricator.wikimedia.org/T200420 [12:17:41] addshore: unless, it eats of all the machine's resources [12:17:41] addshore, Amir1: should I stop the train? [12:18:18] let's investigate a little bit further [12:18:40] what happened in 11:00 UTC today? [12:18:40] addshore, Amir1: let me know if T200420 is blocking the train [12:18:40] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [12:18:53] zeljkof: it might require wikidatawiki to be rolled back [12:19:00] (03CR) 10Filippo Giunchedi: [C: 031] Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) (owner: 10Gergő Tisza) [12:19:30] ops: Is anything wrong with redis nodes? [12:19:40] addshore: I don't have experience with rolling back just one wiki, does that stop the train? [12:19:51] zeljkof: I don't know, Im sure Reedy does :D [12:25:45] (03PS4) 10Vgutierrez: Provide a valid pebble config for the integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/448020 (https://phabricator.wikimedia.org/T200405) [12:26:06] Amir1: see anything? [12:26:25] addshore: Can't find anything anywhere [12:26:29] it seems pretty random [12:26:46] the only thing it lines up remotely with is the .14 rollout [12:27:37] yeah [12:27:40] :/ [12:28:11] we didn't touch dispatching code in the past couple of weeks [12:28:13] then lets rollback wikidatawiki and see if anything changes [12:28:15] Amir1: indeed [12:28:21] maybe the lockmanager in core has changed? [12:28:25] *looks* [12:29:01] https://github.com/wikimedia/mediawiki/commits/130ec2523df12a3ca2fe0d422163696d09fcea08/includes/libs/lockmanager [12:29:13] not really [12:32:44] Amir1: roll back? [12:32:53] I'd go with it. [12:33:06] I see nothing odd in any logs etc either, or logstash for mwmaint, mmhmhmm [12:33:29] perhaps now is the time we write sopme easy code to execute to removal all locks? :P [12:33:46] zeljkof: Reedy which one of you can rollback wikidatawiki then? [12:33:59] addshore: can't you? [12:34:11] * addshore never has / doesn't know what is needed to do it [12:35:04] addshore, Reedy: I'm on lunch break, back for train window [12:35:05] (03PS1) 10Reedy: Rollback wikidatawiki to .13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448032 [12:35:28] (03CR) 10Reedy: [C: 032] Rollback wikidatawiki to .13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448032 (owner: 10Reedy) [12:36:37] Reedy: please tag T200420 in the log message [12:36:38] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [12:36:55] (03Merged) 10jenkins-bot: Rollback wikidatawiki to .13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448032 (owner: 10Reedy) [12:38:03] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki back to .13 T200420 [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:28] (03CR) 10jenkins-bot: Rollback wikidatawiki to .13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448032 (owner: 10Reedy) [12:38:43] thanks, we have to wait for the locks to expire before we see anything anyway [12:40:46] (03PS1) 10Addshore: Wikidata dispatch, select 20 wikis instead of 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448034 (https://phabricator.wikimedia.org/T200420) [12:41:07] (03CR) 10Addshore: [C: 032] Wikidata dispatch, select 20 wikis instead of 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448034 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [12:42:22] (03Merged) 10jenkins-bot: Wikidata dispatch, select 20 wikis instead of 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448034 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [12:42:51] (03CR) 10jenkins-bot: Wikidata dispatch, select 20 wikis instead of 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448034 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [12:44:01] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: T200420 - Wikidata, dispatch, select 20 instead of 15 wikis (duration: 00m 55s) [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:05] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [12:45:54] so, even when selecting 20 instead of 15, so other wikis are selected, they are all still locked [12:46:48] Amir1: I'll turn verbose logging on and see if it shows us anything too? [12:47:24] yeah but is the locks got expired? [12:47:39] the locks will still be there for another 20 mins I think [12:47:43] 10Operations, 10User-notice: 2018 data center switchover: Move all the things over to codfw - https://phabricator.wikimedia.org/T200022 (10Trizek-WMF) When it it? [12:48:06] I have only done T178652 for testwikidatawiki so far, was going to do wikidatawiki next week [12:48:07] T178652: Wikidata dispatchers should use a LockManager with a short TTL - https://phabricator.wikimedia.org/T178652 [12:59:41] so, some locks expired and some dispatching happened again, i guess the others will expire over the next 15 mins, then we can see if they get released or not [13:00:04] zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1300). [13:01:25] My gut says something odd is happening with redis for our locking, but I have no idea how to check that really [13:08:32] Amir1: what ops are normally around at this time? I think it might be worth getting someone to look at redis [13:10:18] (03PS1) 10Addshore: Verbose logging for wikidata dispatching [puppet] - 10https://gerrit.wikimedia.org/r/448037 (https://phabricator.wikimedia.org/T200420) [13:10:30] addshore: channel topic says robh should be around [13:10:38] zeljkof: robh is currently marked as away [13:11:05] Getting that verbose logging turned on would be great ^^ [13:11:27] * addshore should file a ticket to enable us to turn that on from mediawiki-config as with the other dispatching config vars [13:12:18] addshore: I am around [13:12:22] what can I help with ? [13:12:38] could you please merge the puppet patch above? :) turning on verbose logging for the wikidata dispatcher [13:12:57] and force a puppet run on mwmaint1001, to get it on the cron asap [13:13:07] addshore: akosiaris and _joe_ afaik [13:13:19] sure [13:13:21] (03CR) 10Alexandros Kosiaris: [C: 032] Verbose logging for wikidata dispatching [puppet] - 10https://gerrit.wikimedia.org/r/448037 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [13:13:53] I'm at lunch break (Today is ORES day anyway :D) [13:14:08] Amir1: ack :) I have this anyway, I'll keep diggin [13:14:19] addshore, Amir1: is T191060 blocked on T200420? Should I wait with train, or continue with it? [13:14:20] T191060: 1.32.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T191060 [13:14:20] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [13:15:13] addshore: done [13:15:15] zeljkof: personally I don't think this tikcet should block the train, we rolled back wikidatawiki to .13 already and it didn't solve the problem [13:15:37] akosiaris: thanks, I'll wait for the next cron to start and I should start seeing some extra output [13:15:59] * addshore isn't even sure if the output will help, but right now I want to see everything [13:16:01] addshore: so, I should push .14 everywhere? [13:16:15] I mean, the usual train schedule [13:16:39] it might be worth waiting until we find the cause of what is happening [13:17:33] addshore: ok, I'll add T200420 as a blocker for now, when you have more data feel free to remove it from blockers, if it's unrelated, sounds good? [13:17:45] sounds good [13:18:02] addshore: deal, train is blocked for now, let me know when you have more info [13:28:00] akosiaris: any idea how I can see the locks that are in redis? :/ [13:29:56] hm, I have to research that [13:31:48] or any other logs about redis potentially doing odd things, but I guess they would all be in logstash [13:32:11] yes any mediawiki logs about that would be in logstash [13:35:31] godog, Reedy: o/ - yes I remember the patch, I'll re-take a look and +1 it in a bit. [13:36:04] godog: for context, we postponed it last friday since this change modifies the redirect of http://wikimania.wikimedia.org/ [13:36:14] that atm goes to wikimania2018.wikimedia.org [13:36:32] but in theory, after the patch, should not be needed anymore for the future [13:36:43] (but I didn't want to break anybody during wikimania) [13:39:39] Right, Amir1 I think I'm just going to switch over to a lock manager with a shorter time out now.. [13:39:50] (03PS3) 10Addshore: Use new wikibase dispatch lock manager on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395969 (https://phabricator.wikimedia.org/T178652) [13:40:53] (03CR) 10Addshore: [C: 032] Use new wikibase dispatch lock manager on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395969 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [13:42:03] (03Merged) 10jenkins-bot: Use new wikibase dispatch lock manager on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395969 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [13:42:20] (03CR) 10jenkins-bot: Use new wikibase dispatch lock manager on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395969 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [13:43:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) nevermind, I saw it, it doesn't happen very often though. Will try to figure out its origin. [13:44:00] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use new wikibase dispatch lock manager on wikidatawiki T200420 T178652 (duration: 00m 55s) [13:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:05] T178652: Wikidata dispatchers should use a LockManager with a short TTL - https://phabricator.wikimedia.org/T178652 [13:44:06] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [13:44:27] Right, once the new crons start now the recovery from stuck locks should be lots quicker... [13:47:01] * akosiaris watching the log [13:47:16] * addshore looks at the code again [13:52:10] (03Abandoned) 10Dzahn: mariadb: remove grants for terbium (do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/431042 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [13:52:36] right, it looks like some dispatching is happening again now, but lets see if all of the locks actually recover or they remain locked again [13:54:08] i wish the log lines also output the timestamp of the start time of the script they come from :P [13:57:03] looks like Lockmanager has a logger too, but it is never used anywhere.. I imagine the output of that could hlep... heh [14:08:34] well, dispatching seems to be flowing a little better now [14:08:35] (03CR) 10Elukey: "Can we break down the change in two? First we go with mediawiki_test, that will be applied (safely) only to mwdebug hosts. After testing, " [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [14:09:07] * addshore waits to see if it locks up again after a few more mins [14:12:18] (03PS1) 10Ema: vcl: use synth() instead of error in the code shutting users out [puppet] - 10https://gerrit.wikimedia.org/r/448041 (https://phabricator.wikimedia.org/T129424) [14:12:41] (03CR) 10Imarlier: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) (owner: 10Gergő Tisza) [14:12:42] and it all got locked up again....... [14:13:04] at least now it should attempt to recover every 15 mins instead of every 1 hour [14:15:00] (03CR) 10Ema: [C: 032] vcl: use synth() instead of error in the code shutting users out [puppet] - 10https://gerrit.wikimedia.org/r/448041 (https://phabricator.wikimedia.org/T129424) (owner: 10Ema) [14:16:13] Right, the LockManager if failing to acquire locks, but it also apparently cant remove them... [14:16:41] trying to lock a key I get "lockmanager-fail-acquirelock" and trying to remove it I get lockmanager-notlocked [14:21:35] (03PS5) 10MSantos: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/447851 (https://phabricator.wikimedia.org/T194787) [14:21:43] 10Operations, 10Traffic, 10Continuous-Integration-Config, 10Patch-For-Review: Provide a CI container with pebble - https://phabricator.wikimedia.org/T200405 (10Legoktm) a:03Vgutierrez [14:23:28] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Get helm test to dump more information - https://phabricator.wikimedia.org/T200348 (10akosiaris) FWIW and strictly speaking, helm is NOT deploying the service-checker pod. That would be `tiller`, the in cluster //server// part of helm. An... [14:24:40] !log sudo gnt-instance modify --disk add:size=150G webperf2002 T199853 [14:24:43] !log sudo gnt-instance modify --disk add:size=150G webperf1002 T199853 [14:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:45] T199853: Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 [14:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:40] (03PS1) 10Addshore: Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) [14:27:01] (03PS1) 10Addshore: Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 [14:27:14] (03PS2) 10Addshore: Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 (https://phabricator.wikimedia.org/T200420) [14:28:17] (03PS1) 10Alexandros Kosiaris: Bump Deployment apiVersion to apps/v1beta2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/448045 [14:28:38] (03CR) 10Addshore: [C: 032] Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:29:40] 10Operations, 10vm-requests, 10Performance-Team (Radar): Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10herron) a:05herron>03akosiaris [14:30:09] (03PS1) 10Volans: Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) [14:30:11] (03PS1) 10Volans: Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) [14:31:25] (03PS2) 10Addshore: Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) [14:31:30] (03CR) 10Addshore: [C: 032] Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:31:35] (03PS3) 10Addshore: Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 (https://phabricator.wikimedia.org/T200420) [14:33:02] (03Merged) 10jenkins-bot: Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:33:28] (03CR) 10jenkins-bot: Add a logger to wikibaseDispatchRedisLockManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448042 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:33:40] addshore: please do let me know if there are any progress with T200420, it would be great if we could move the train forward today [14:33:41] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [14:33:59] no progress yet :/ [14:34:05] just about to get some more logs to look at [14:34:30] addshore: no pressure, but I have to ask :) [14:34:37] !log addshore@deploy1001 sync-file aborted: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 02s) [14:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:52] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 56s) [14:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] (03CR) 10Addshore: [C: 032] Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:37:13] (03Merged) 10jenkins-bot: Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:38:52] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wikibaseDispatchRedisLockManager to wmgMonologChannels T200420 (duration: 00m 54s) [14:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:55] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [14:39:00] (03CR) 10jenkins-bot: Add wikibaseDispatchRedisLockManager to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448043 (https://phabricator.wikimedia.org/T200420) (owner: 10Addshore) [14:40:35] 10Operations, 10User-notice: 2018 data center switchover: Move all the things over to codfw - https://phabricator.wikimedia.org/T200022 (10Jdforrester-WMF) >>! In T200022#4453157, @Trizek-WMF wrote: > When it it? According to the parent task, T199676: > The time frame for the switchover is: > > * Switching f... [14:40:59] * addshore waits for the next dispatcher to start to see stuff in wikibaseDispatchRedisLockManager [14:41:26] (03PS3) 10Dzahn: maps: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447937 [14:43:21] (03PS1) 10Urbanecm: Introduce autopatrolled on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448050 (https://phabricator.wikimedia.org/T199475) [14:47:11] AaronSchulz: around at all? [14:48:45] (03PS1) 10Gehel: check_legal_html: update check to reflect new URL [puppet] - 10https://gerrit.wikimedia.org/r/448051 [14:52:11] addshore: hmm? [14:52:12] zeljkof: as antoine is out can I ask you for a change to zuul/layout.yaml? [14:52:36] (03Abandoned) 10Gehel: check_legal_html: update check to reflect new URL [puppet] - 10https://gerrit.wikimedia.org/r/448051 (owner: 10Gehel) [14:52:47] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 [14:52:50] AaronSchulz: I'm trying to figure out what is up with wikidata dispatching, and now I'm just staring at RedisLockManager and QuorunLockManager trying to see if either of them are some how the cause [14:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:51] T199491: Expose /page/content-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [14:53:07] If I do the following: [14:53:10] https://www.irccloud.com/pastebin/ffOvXFMK/ [14:53:51] volans: sure, I'll do my best [14:53:51] I get the following: [14:53:54] https://www.irccloud.com/pastebin/EePUUtMq/ [14:54:16] zeljkof: should be straightforward: https://gerrit.wikimedia.org/r/#/c/integration/config/+/448049/ [14:54:18] I don't appear to be able to acquire the lock or remove the lock... [14:54:19] thanks a lot! [14:54:52] nothing seems to have changed or got deployed when this started happening, so I'm just looking all around the place [14:54:56] (03CR) 10Dzahn: [C: 032] maps: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447937 (owner: 10Dzahn) [14:55:38] I switched dispatching to use the new wikibaseDispatchRedisLockManager that I has tested on testwikidatawiki, but the behaviour stayed the same, at least now the locks time out after 15 mins instead of 1 hour [14:55:45] volans: ok, just to make it clear, I should merge and deploy it? [14:56:18] And I added a logger to wikibaseDispatchRedisLockManager but it doesn;t seem to actually log anything :( sad times [14:56:39] zeljkof: if possible yes, thanks. I'd like to add tox CI for that repo and I don't have +2 on integrations/config ;) [14:57:18] volans: no problem, the commit looks fine, I'll merge and deploy. I'll let you know when it's done, test it and I can revert in case of problems [14:57:39] great! thanks a lot! [14:57:58] I have also 2 pending CRs on that repo, so we can test with a recheck that it runs as expected [14:59:05] found them, I'll recheck when it's deployed [14:59:10] (03PS2) 10Dzahn: grafana: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447938 [15:00:33] (03CR) 10Jforrester: "Tut, this is meant to be three distinct patches to avoid deployment sync whoopsies. ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) (owner: 10Reedy) [15:02:55] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @Deskana @dr0ptp4kt I've been pointed to the two of you as people who have acces... [15:03:09] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) a:05Imarlier>03None [15:03:48] any ideas AaronSchulz ? :( [15:03:56] (03CR) 10Zfilipin: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:04:04] (03CR) 10Zfilipin: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:04:19] volans: ^ [15:04:25] yeah, following ;) [15:04:53] volans: the jobs are running... https://integration.wikimedia.org/zuul/ [15:05:52] volans: and both are green :) [15:05:56] all looks good! Thanks a lot! [15:06:08] * volans hopes to not have bypassed some procedure [15:07:20] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/948246b0-90e5-11e8-8c17-41ee0e0d81cc [15:08:20] addshore: is it just that key? [15:08:30] no, it is all of the dispatching keys [15:08:49] after the ttl expires them dispatching works for a short while and then it all locks up again [15:09:30] just commented @ https://phabricator.wikimedia.org/T200420#4453529 which is as far as I have got so far [15:12:00] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [15:15:20] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/37570dcb-90e6-11e8-b06e-879dbc791e34 [15:16:30] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0bae24f4-90e6-11e8-8408-f8688bd3df1f [15:17:31] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [15:17:40] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:18:05] (03PS1) 10Aaron Schulz: Set logging for "LockManager" (e.g. that of LockManagerGroup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448055 [15:18:16] (03CR) 10Reedy: "extension-list being separate isn't going to help anyone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) (owner: 10Reedy) [15:18:29] addshore: ^ [15:18:35] (03CR) 10Addshore: [C: 032] Set logging for "LockManager" (e.g. that of LockManagerGroup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448055 (owner: 10Aaron Schulz) [15:19:45] aaah, the logger is always LockManager [15:19:53] Didn't spot that! :) [15:20:02] (03Merged) 10jenkins-bot: Set logging for "LockManager" (e.g. that of LockManagerGroup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448055 (owner: 10Aaron Schulz) [15:21:32] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add LockManager logging T200420 (duration: 00m 55s) [15:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [15:21:46] (03PS1) 10Addshore: Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 [15:21:52] (03CR) 10Addshore: [C: 032] Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 (owner: 10Addshore) [15:22:01] (03PS1) 10Addshore: Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 [15:22:07] (03PS2) 10Addshore: Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 [15:22:10] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/37570dcb-90e6-11e8-b06e-879dbc791e34 [15:22:11] (03CR) 10Addshore: [C: 032] Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 (owner: 10Addshore) [15:22:11] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0bae24f4-90e6-11e8-8408-f8688bd3df1f [15:22:20] * addshore reverts the other logging [15:23:00] (03PS1) 10Aaron Schulz: Remove "wikibaseDispatchRedisLockManager" logger (not used by MediaWiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448058 [15:23:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 (owner: 10Addshore) [15:23:21] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:23:41] (03PS2) 10Addshore: Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 [15:23:50] (03CR) 10Addshore: [C: 032] Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 (owner: 10Addshore) [15:23:58] (03PS3) 10Addshore: Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 [15:24:02] (03CR) 10Addshore: [C: 032] Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 (owner: 10Addshore) [15:24:53] (03Abandoned) 10Aaron Schulz: Remove "wikibaseDispatchRedisLockManager" logger (not used by MediaWiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448058 (owner: 10Aaron Schulz) [15:25:14] (03Merged) 10jenkins-bot: Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 (owner: 10Addshore) [15:25:40] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [15:25:40] (03Merged) 10jenkins-bot: Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 (owner: 10Addshore) [15:27:01] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0861b559-90e7-11e8-909c-3956ea1bbb30 [15:27:20] !log addshore@deploy1001 Synchronized wmf-config: Remove unused wikibaseDispatchRedisLockManager logging T200420 (duration: 00m 56s) [15:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:24] T200420: Wikidata dispatching stuck - https://phabricator.wikimedia.org/T200420 [15:28:10] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [15:28:55] * addshore sees no LockManager logs :( [15:30:08] any idea how is it possible that my system detects a virus in C:\Users\Yann\AppData\Roaming\HexChat\scrollback\freenode\#wikimedia-operations.txt ? [15:30:24] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 (duration: 37m 37s) [15:30:30] 10Operations, 10Analytics, 10Services, 10Discovery-Search (Current work), 10Patch-For-Review: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10fdans) [15:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] T199491: Expose /page/mobile-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [15:30:39] could i get it from here? [15:30:44] 10Operations, 10Analytics, 10Services, 10Discovery-Search (Current work), 10Patch-For-Review: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10fdans) p:05Triage>03Normal [15:31:22] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 [15:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:39] addshore: does lock() work for random keys (e.g. wfRandomString())? [15:32:33] twice in 2 days [15:33:00] it uses the same servers as the one used with swift, that works, and the logs are quite, so maybe something gets the locks and fails in wb? [15:33:01] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/729bbf80-90e5-11e8-b22b-0410e7d755b6 [15:33:16] yannf: yes, because people have been spamming in this channel, so your antivirus is probably detecting the spam as problematic [15:33:25] https://www.irccloud.com/pastebin/hqyWYSzw/ [15:33:28] AaronSchulz: ^^ works [15:33:50] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page returned the unexpected status 404 (expecting: 200) [15:33:51] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0bae24f4-90e6-11e8-8408-f8688bd3df1f [15:34:05] (03PS1) 10Bstorm: dumps distribution: set labstore1007 as the VPS NFS host [puppet] - 10https://gerrit.wikimedia.org/r/448059 (https://phabricator.wikimedia.org/T196651) [15:34:17] legoktm, ah ok, if it is only spam, fine [15:34:22] thanks [15:35:00] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:35:01] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/37570dcb-90e6-11e8-b06e-879dbc791e34: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated [15:35:01] ril 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [15:35:10] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0861b559-90e7-11e8-909c-3956ea1bbb30 [15:35:30] (03CR) 10Bstorm: [C: 032] dumps distribution: set labstore1007 as the VPS NFS host [puppet] - 10https://gerrit.wikimedia.org/r/448059 (https://phabricator.wikimedia.org/T196651) (owner: 10Bstorm) [15:35:45] (03PS2) 10Volans: Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) [15:36:10] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/37570dcb-90e6-11e8-b06e-879dbc791e34: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated [15:36:10] ril 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [15:36:11] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0bae24f4-90e6-11e8-8408-f8688bd3df1f: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated [15:36:11] ril 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [15:36:17] mobrovac: ---^ [15:36:20] mobrovac: is the mobile-html end point failing expected ? [15:36:27] I see you just deployed a new version [15:37:02] ah maybe the new endpoint is raising some garbage alarms [15:37:04] good point :) [15:42:47] AaronSchulz: can our redis lock managing cluster get split brained some how and not alert us to the fact? [15:45:40] yup akosiaris, all good [15:45:58] well, not good, but "expected" [15:46:17] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 (duration: 14m 56s) [15:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:21] T199491: Expose /page/mobile-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [15:46:33] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 [15:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:36] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5c9bf79]: Expose the mobile-html end point - T199491 (duration: 01m 04s) [15:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:17] addshore: I don't think so. There always has to be an overlapping node for ok StatusValue's. [15:57:45] (03CR) 10jenkins-bot: Set logging for "LockManager" (e.g. that of LockManagerGroup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448055 (owner: 10Aaron Schulz) [15:57:48] (03CR) 10jenkins-bot: Revert "Add wikibaseDispatchRedisLockManager to wmgMonologChannels" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448056 (owner: 10Addshore) [15:57:51] (03CR) 10jenkins-bot: Revert "Add a logger to wikibaseDispatchRedisLockManager" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448057 (owner: 10Addshore) [15:58:19] addshore: from the task, why did you start with unlock() ? [15:58:39] because the dispatch script said that the lock already exists [15:58:48] you can't unlock via an LM instance different than the one that locked it (since the request-specific session ID won't match) [15:59:29] aaah okay! that explains why the unlock isn't working, but the lock call still confirms that the lock is in place [15:59:44] it's not lock the BagOStuff ones, which just assume that anything calling unlock() is the current lock owner [15:59:48] *not like [15:59:59] (03PS1) 10Cmjohnson: Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) [16:00:04] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1600). [16:00:04] Urbanecm: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) (owner: 10Cmjohnson) [16:00:16] I'm here, in case I'm needed [16:00:21] so, I did some more poking around with eval and it looks like the locks exist on the first 2 servers being used but not the thrid one [16:00:46] addshore: it short-circuits at getting a majority (2/3) [16:01:05] if the one of the first 2 failed, then it would use the third [16:01:41] Urbanecm: ack, looking [16:01:42] okay, so is a situation where a lock is on rdb1 and rdb2 but not rdb3 fine? [16:03:00] yes [16:03:06] (03PS6) 10Filippo Giunchedi: Support zh-my localized URLs in Wikipedia virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) (owner: 10Urbanecm) [16:04:12] (03PS2) 10Cmjohnson: Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) [16:05:18] (03PS1) 10Thcipriani: Gerrit 2.15.3 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/448063 (https://phabricator.wikimedia.org/T199460) [16:06:10] (03CR) 10Filippo Giunchedi: [C: 032] Support zh-my localized URLs in Wikipedia virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/443215 (https://phabricator.wikimedia.org/T198371) (owner: 10Urbanecm) [16:07:35] (03CR) 10RobH: [C: 031] Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) (owner: 10Cmjohnson) [16:07:37] (03CR) 10Gehel: Add common base utility modules (0315 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:08:03] Urbanecm: {{done}} should be fully rolled out in 1h [16:08:13] Good. Anything else I should do? [16:08:51] !log rolling restart of elasticsearch / cirrus / eqiad completed - T156137 [16:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:55] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [16:10:13] Urbanecm: no I don't think so [16:10:20] Ok. [16:11:00] godog, will you merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446786/ as well? [16:11:38] Urbanecm: there was a comment on that patch, cc Reedy [16:11:42] (03CR) 10Andrew Bogott: [C: 031] Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) (owner: 10Cmjohnson) [16:12:13] Oh, missed it, thanks [16:13:53] Urbanecm: the comment is mine :) IIUC the mediawiki_test puppet environment is meant to safely merge changes that applies only to mwdebug* [16:14:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown, 10Patch-For-Review: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Bstorm) I think all tasks on this are done at this point @Cmjohnson -- the servers are both done, in serv... [16:14:05] Do we really need to split it? :) [16:14:24] I know, I meant I didn't notice it, otherwise I'd understand why it wasn't merged by godog :) [16:14:27] that's the idea yes :D [16:14:43] (03CR) 10Gehel: Add common base utility modules (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:14:59] Urbanecm: nono sorry I just wanted to explain my comment, that's it [16:15:18] Oh, okay then. Thank you elukey [16:17:10] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/production dns cloudelastic1001-4 [dns] - 10https://gerrit.wikimedia.org/r/448062 (https://phabricator.wikimedia.org/T194186) (owner: 10Cmjohnson) [16:19:39] AaronSchulz: can there be some kind of race acquiring the locks? :/ [16:20:53] (03PS5) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [16:20:55] (03PS1) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) [16:21:45] 10Operations, 10Traffic: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ema) [16:21:58] 10Operations, 10Traffic: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ema) p:05Triage>03Normal [16:24:11] aaron, I'm lost at things to look at, im going to go and grab some food [16:24:31] zeljkof: that ticket probably shouldnt block the train now, and it's probably find to move wikidatawiki to .14 too [16:25:05] addshore: ok, thanks for letting me know [16:27:02] godog: ^ They're split [16:27:44] heh [16:28:27] (03CR) 10Dzahn: [C: 032] grafana: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447938 (owner: 10Dzahn) [16:28:48] (03PS3) 10Dzahn: grafana: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447938 [16:29:02] Reedy: I am checking a thing with the puppet compiler, will take care of merging in a sec [16:30:22] 10Operations, 10Traffic: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ema) [16:31:01] (03PS2) 10Dzahn: deployment_server: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447939 [16:32:57] Reedy: thanks for splitting the patches! elukey will merge them [16:36:53] Reedy: please have a bit of patience with me [16:37:01] so where does "sites-available/wikimaniateam.wikimedia.org.conf" come from? [16:37:03] (03CR) 10Paladox: "Needs to be uploaded to archiva" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/448063 (https://phabricator.wikimedia.org/T199460) (owner: 10Thcipriani) [16:37:08] I don't find it in the apache config [16:37:09] 10Operations, 10Wikimedia-Apache-configuration, 10Chinese-Sites, 10Patch-For-Review, 10User-Urbanecm: All "zh-my" variant page views get 404 Not Found on zh.wikipedia.org - https://phabricator.wikimedia.org/T198371 (10RazeSoldier) I am very happy to see that this problem has been solved. [16:38:06] elukey: https://github.com/wikimedia/puppet/blob/production/environments/mediawiki_test/modules/mediawiki_exp/templates/apache/sites/included/wikimaniateam.wikimedia.org.conf.erb [16:39:39] does it need to be removed if we include it directly? [16:40:04] We're not touching it? [16:40:18] Oh [16:40:20] ffs [16:40:30] It's basically a merge conflict [16:40:54] (03PS6) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [16:41:09] (03PS7) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [16:41:19] (03PS2) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) [16:41:22] ah okok sorry :( [16:41:25] looks better now [16:41:30] It's probably my fault [16:41:34] I thought I fixed that once :) [16:41:49] other thing (please don't kill me for the nitpick) - rewrite wikimania2019.wikimedia.org //wikimania.wikimedia.org [16:42:03] there is not tab in there, but a space.. so I have no idea about that .dat file [16:42:11] does it need to be a tab or a space is fine? [16:42:42] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [16:42:45] I think syntactically, a space is fine [16:42:50] But for consistency, should be a tab [16:43:21] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [16:43:59] (03PS8) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [16:45:00] now there is a tab in redirects.conf [16:45:43] (03PS3) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) [16:46:31] PROBLEM - High load average on labstore1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [16:46:31] (03PS4) 10Elukey: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [16:47:28] not the right one to rebase.. Reedy - let's remove the extra tab in the testing one and we are good to go [16:47:35] I have already disabled puppet across the fleet [16:47:52] jouncebot: now [16:47:52] For the next 0 hour(s) and 12 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1600) [16:48:41] hello mutante :) [16:49:23] (03PS9) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [16:50:13] (03PS10) 10Elukey: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [16:50:54] (03CR) 10Elukey: [C: 032] Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [16:52:45] 10Operations, 10Traffic: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ema) [16:52:51] Reedy: live on mwdebug1002 [16:53:48] I get "no wiki found" [16:53:50] Which is expected [16:53:59] And no redirect. so LGTM [16:54:01] yep I am checking as well [16:54:07] everything looks good [16:55:14] from apache-test [16:55:16] https://wikimania.wikimedia.org * 404 Not Found [16:55:16] https://wikimania2015.wikimedia.org * 301 Moved Permanently https://wikimania2015.wikimedia.org/wiki/Special:MyLanguage/Wikimania [16:55:19] https://wikimania2018.wikimedia.org * 301 Moved Permanently https://wikimania2018.wikimedia.org/wiki/Special:MyLanguage/Wikimania [16:55:22] https://wikimania2019.wikimedia.org * 301 Moved Permanently https://wikimania.wikimedia.org/ [16:56:00] good [16:56:05] let's do the other one then [16:56:23] (03PS5) 10Elukey: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [16:56:55] (I also verified that running puppet on mw1266 didn't change anything as expected, so the mediawiki_exp puppet env works fine) [16:57:20] (03CR) 10Elukey: [C: 032] Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/448064 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [16:58:19] testing on mw1266 (app server) [16:59:58] (03PS1) 10BBlack: p::cache::base: fix storage_parts typo [puppet] - 10https://gerrit.wikimedia.org/r/448075 [17:00:00] (03PS1) 10BBlack: storage config tweaks for cp1075-99 [puppet] - 10https://gerrit.wikimedia.org/r/448076 (https://phabricator.wikimedia.org/T195923) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1700). [17:00:49] all good as well, nothing exploding in the error log [17:01:35] !log moving row C vrrp master to cr1 - T187962 [17:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:40] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [17:01:50] PROBLEM - Memory correctable errors -EDAC- on wtp2011 is CRITICAL: 6 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [17:02:17] Reedy: going to roll it out on mw2* hosts first, check error logs just in case, and then do the same for eqiad [17:02:22] cheers [17:02:23] will ping you when done [17:02:40] thanks for the patience :) [17:03:51] RECOVERY - High load average on labstore1005 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [17:04:21] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/0bae24f4-90e6-11e8-8408-f8688bd3df1f [17:07:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:08:50] AaronSchulz: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Wikibase.php#L131-L134 [17:08:55] max time is set :) [17:09:20] and for max-passes the doc is "Default: 1 if --max-time is not set, infinite if it is." [17:09:22] ficing the restbase alerts ^ [17:09:27] s/ficing/fixing/ [17:10:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:11:10] seems api related --^ [17:11:21] but also a temp spike [17:11:30] I am currently running the new apache config only in codfw [17:12:56] two temp spikes sorry [17:14:12] PROBLEM - High load average on labstore1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [17:20:53] addshore: I was reading that as 1 : PHP_INT_MAX, probably since "infinite time => just one pass" seemed odd [17:22:09] running puppet on the eqiad apaches [17:25:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:26:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:27:19] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5ce3a68]: Fix the failing mobile-html test - T199491 [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:23] T199491: Expose /page/mobile-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [17:27:24] AaronSchulz: it all still seems super odd to me [17:27:48] none of the code changed then it started happening, which screams something odd is happening with redis to me [17:31:13] addshore: btw, when is chd_lock not NULL? [17:31:26] hmm, I think that should be always now [17:31:41] chd_lock is just left over from when sql was used for locking. *checks in the db* [17:32:31] << $this->releaseClientLock( $db, $state['chd_lock'] ); => $this->lockManager->unlock( [ $lock ] )->isOK(); >> seems odd to me [17:32:48] where is that? [17:33:05] the parent class, SqlChangeDispatchCoordinator [17:33:39] nothing overrides that code afaik (releaseClient() ) [17:33:55] Reedy: change rolled out [17:33:55] phpstorm just violently exploded, let me reload it quickly [17:34:11] heh, I'm using phpstorm [17:38:11] releaseClientLock is overridden in the LockmanagerCoordinator though [17:38:40] but it still uses $lock [17:39:30] $lock is just the name of the lock? [17:39:31] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [17:39:58] oh, hmmm [17:41:35] AaronSchulz: chd_lock: the name of a global lock currently active for that client wiki [17:41:40] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [17:42:30] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [17:42:31] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [17:42:39] chd_lock is set in lockClient [17:43:39] (03PS3) 10Gehel: Enable constraints fetching for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/447740 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:43:51] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [17:44:25] (03CR) 10Gehel: [C: 032] Enable constraints fetching for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/447740 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:45:08] my hunch is still it has something to do with the rapid acquisition and revoking of locks, but, that even seems far fetched.... can't pin it down on anything [17:45:11] * addshore goes to make food [17:45:57] !log created new archiva user tyler for Tyler Cipriani and gave him role of global repo manager (for gerrit uploads instead of using archiva-deploy user which can cause password sync issues, so personalized users ) [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:01] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/37570dcb-90e6-11e8-b06e-879dbc791e34 [17:46:19] addshore: in the lockClient() path, what about unlock()? [17:46:19] jouncebot: now [17:46:19] For the next 0 hour(s) and 13 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1700) [17:46:31] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title}{/revision} (Get mobile-html of a test page) is CRITICAL: Test Get mobile-html of a test page had an unexpected value for header etag: W/830543386/729bbf80-90e5-11e8-b22b-0410e7d755b6 [17:46:54] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5ce3a68]: Fix the failing mobile-html test - T199491 (duration: 19m 36s) [17:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:58] T199491: Expose /page/mobile-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [17:47:09] mobrovac: should I wait a bit to deploy? [17:47:39] arlolra: no, you can go ahead, i'm deploying unrelated things (some monitoring check fixes) [17:47:40] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [17:47:52] ok, thanks [17:47:54] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5ce3a68]: Fix the failing mobile-html test, take #2 - T199491 [17:47:55] AaronSchulz: the releaseClientLock path? chd_site is set to null in releaseClient line 504 just after releaseClientLock is called which does the redis lock removal using the name [17:47:56] !log arlolra@deploy1001 Started deploy [parsoid/deploy@1dce68c]: Updating Parsoid to cdf8ace [17:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [17:50:21] addshore: and the name is $state['chd_lock']. I'm trying to see where that is set (other than from the NULL DB value). [17:51:05] AaronSchulz: i had that a second ago, let me look again [17:51:06] the << $state['chd_lock'] = null; >> seems irrelevant [17:52:11] PROBLEM - High load average on labstore1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [17:53:00] AaronSchulz: maybe I was dreaming seeing that set... [17:53:15] but counter point, if that is not set, how has this been working up until now? ;) [17:53:17] I'm sure it used to be before some patches a while back [17:53:25] so unlock( null ) would explain this right? [17:53:33] everything would get stuck until expiration [17:53:39] unlock null would explain this yes [17:53:54] and why there are no LM nor filebackend errors [17:54:04] so, apparently that happened at 11am today? which moved wikidatawiki from .13 to .14 [17:54:10] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5ce3a68]: Fix the failing mobile-html test, take #2 - T199491 (duration: 06m 16s) [17:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:14] T199491: Expose /page/mobile-html endpoint via RESTBase - https://phabricator.wikimedia.org/T199491 [17:54:15] I made patch to throw for null LM paths [17:56:41] (03CR) 10Dzahn: [C: 032] osm: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447941 (owner: 10Dzahn) [17:56:42] RECOVERY - High load average on labstore1005 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [17:58:47] !log moving row C vrrp master back to cr2 - T187962 [17:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:50] T187962: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 [17:59:05] AaronSchulz: oaky, im going to look a bit more to see if I can find where this thing went [17:59:18] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@1dce68c]: Updating Parsoid to cdf8ace (duration: 11m 22s) [17:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1800). [18:00:04] tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:42] o/ [18:01:45] * addshore eats dinner [18:02:06] (03PS1) 10Dzahn: backup: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448083 [18:02:42] tgr: I would like to move the train forward [18:03:09] not sure how to do it, before or after swat, probably better after? cc greg-g? [18:03:48] (03PS1) 10Dzahn: prometheus: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448084 [18:04:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:04:20] next window is gerrit upgrade, does not sound like a good time for either SWAT or train [18:04:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:04:50] tgr: we are behind on train, greg-g said he would like to get it back on schedule [18:05:03] (03PS1) 10Dzahn: yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448086 [18:05:14] zeljkof: do you expect it to take long? I can do the SWAT at 21 UTC, that's the next free window [18:05:14] jouncebot: next [18:05:14] In 0 hour(s) and 54 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [18:05:14] In 0 hour(s) and 54 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [18:05:36] hmm. those at the same time? [18:05:47] tgr: no, if all goes well I'm done in a few minutes, in case of trouble a few more minutes to revert [18:06:13] mutante: it's the week of EU train so nothing happens during US train [18:06:16] AaronSchulz: I can't even find where that is set back in REL1_30 ... [18:06:19] even better then, please go ahead [18:06:22] how has this been working? [18:06:24] zeljkof: oh, i see. thx [18:06:38] tgr: ok, moving the train forward [18:08:11] PROBLEM - High load average on labstore1005 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [18:08:30] (03PS1) 10Dzahn: swift: use :;profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448087 [18:08:31] !log Updated Parsoid to cdf8ace (T194806, T110004, T156099, T188478, T194083) [18:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:40] T188478: Parsoid needs a native implementation of - https://phabricator.wikimedia.org/T188478 [18:08:40] T194806: P-wrapping doesn't match php parser's $inBlockElem - https://phabricator.wikimedia.org/T194806 [18:08:41] T110004: DOM Pass for wrapping bare text found in and other "block" (in html4-parlance) nodes like
, , . - https://phabricator.wikimedia.org/T110004 [18:08:41] T194083: Found nested inserted dom-diff flags! - https://phabricator.wikimedia.org/T194083 [18:08:41] T156099: Be explicit about what ext api exports - https://phabricator.wikimedia.org/T156099 [18:10:36] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448089 [18:10:38] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448089 (owner: 10Zfilipin) [18:10:54] (03PS1) 10Dzahn: maps/vectortiles_master: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448090 [18:11:58] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448089 (owner: 10Zfilipin) [18:12:16] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448089 (owner: 10Zfilipin) [18:12:34] (03PS1) 10Dzahn: discovery: use ::profile::base::firewall (comments only) [puppet] - 10https://gerrit.wikimedia.org/r/448092 [18:12:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:13:05] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14 [18:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:13:11] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10Marostegui) The disk has been swapped by Chris ``` PD: 1 Information Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 0, Span: 0, Arm: 1 Enclosure position: 1 Device Id: 1 WWN: 5000C50... [18:14:37] (03CR) 10Dzahn: [C: 032] "changes to commented lines only" [puppet] - 10https://gerrit.wikimedia.org/r/448092 (owner: 10Dzahn) [18:15:57] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [18:16:08] greg-g: rolling back .14 fatalmonitor exploded :/ [18:16:19] oooh [18:16:29] (&*$*&(@$*! [18:16:34] MAPCacheLRU [18:16:35] (03CR) 10Dzahn: [C: 032] "fixes a TODO and removes a lint:ignore" [puppet] - 10https://gerrit.wikimedia.org/r/448090 (owner: 10Dzahn) [18:16:52] (03PS2) 10Dzahn: maps/vectortiles_master: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448090 [18:17:25] [{exception_id}] {exception_url} UnexpectedValueException from line 144 of /srv/mediawiki/php-1.32.0-wmf.14/includes/libs/MapCacheLRU.php: MapCacheLRU::has called with invalid key. Must be string or integer. [18:17:28] !log set merge.policy.reclaim_deletes_weight=12.0 for wikidatawiki_content in eqiad to attempt to reduce the deleted documents count [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:19:01] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: Revert "all wikis to 1.32.0-wmf.14" [18:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:20] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:19:24] greg-g: rolled back, will add more data to T200456 [18:19:24] T200456: MapCacheLRU::has called with invalid key. Must be string or integer. - https://phabricator.wikimedia.org/T200456 [18:19:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:19:34] zeljkof: thanks, and if you have time, email update :/ [18:19:43] greg-g: sure [18:19:56] then run away :) [18:20:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:21:32] (03PS1) 10Zfilipin: Revert "all wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448095 (https://phabricator.wikimedia.org/T200456) [18:21:49] (03CR) 10Zfilipin: [C: 032] Revert "all wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448095 (https://phabricator.wikimedia.org/T200456) (owner: 10Zfilipin) [18:22:41] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:23:10] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [18:23:40] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [18:24:07] (03Merged) 10jenkins-bot: Revert "all wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448095 (https://phabricator.wikimedia.org/T200456) (owner: 10Zfilipin) [18:24:11] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:24:20] greg-g: will do :D [18:24:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:25:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:27:58] (03CR) 10jenkins-bot: Revert "all wikis to 1.32.0-wmf.14" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448095 (https://phabricator.wikimedia.org/T200456) (owner: 10Zfilipin) [18:29:06] zeljkof: wanna try a quick fix? [18:29:23] tgr: sorry, it's really late here, I have to go [18:29:41] I'm done for today, you can do SWAT if you want to [18:29:46] (as far as I am gone) [18:29:58] (as far as I am concerned, that is) [18:30:00] * zeljkof is gone [18:32:08] (03PS2) 10Dzahn: yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448086 [18:33:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:33:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:33:51] (03CR) 10Dzahn: [C: 032] yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448086 (owner: 10Dzahn) [18:35:11] AaronSchulz: I hacked in a tiny log and can confirm that it isn't passing the lock name when releasing the lock [18:35:38] AaronSchulz: how the .... O_o did this trigger / start happening at 11am and not before, im very confused about that still [18:47:54] greg-g: I have a patch up if someone wants to retry the train [18:49:47] (03CR) 10Thcipriani: "> Needs to be uploaded to archiva" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/448063 (https://phabricator.wikimedia.org/T199460) (owner: 10Thcipriani) [18:51:19] in any case, no time left for SWAT, I'll do it after the gerrit window [18:53:32] AaronSchulz: see https://gerrit.wikimedia.org/r/448103, but I'm still reluctant to stop looking at this as I have no idea how this started happening [18:57:03] here for gerrit support if needed [18:57:08] jouncebot: next [18:57:08] In 0 hour(s) and 2 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [18:57:08] In 0 hour(s) and 2 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [18:57:50] also there are some gerrit related config changes that we could potentially merge [19:00:03] mutante: thanks, twentyafterfour I suppose we can go ahead and get started [19:00:04] mutante, twentyafterfour, and thcipriani: Your horoscope predicts another unfortunate Gerrit upgrade deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900). [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [19:00:30] thanks jouncebot.. [19:01:09] mutante: this would be a good config change to merge: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441397/ [19:01:48] (03CR) 10Thcipriani: [V: 032 C: 032] Merge tag 'v2.15.3' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/447846 (owner: 10Thcipriani) [19:02:31] thcipriani: i wanted to say that :) [19:02:32] (03CR) 1020after4: [C: 031] Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [19:02:45] the only reason i didnt merge yesterday was because users hate to be logged out [19:02:54] (03PS6) 10Dzahn: Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [19:02:56] (03CR) 10Thcipriani: [V: 032 C: 032] Gerrit 2.15.3 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/448063 (https://phabricator.wikimedia.org/T199460) (owner: 10Thcipriani) [19:03:27] (03CR) 1020after4: [C: 031] Merge tag 'v2.15.3' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/447846 (owner: 10Thcipriani) [19:03:30] * addshore is so very done today [19:03:50] twentyafterfour: do you see a submit button on ^ ? [19:04:22] because I do not [19:04:49] (03CR) 10Dzahn: [C: 032] Gerrit: Increase changeid_project and ldap_usernames caches [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [19:04:55] thcipriani: nope [19:05:00] oh good [19:06:17] twentyafterfour: I guess I could just push them? [19:06:26] thcipriani: perhaps [19:06:31] this also has a +1 from chad and is for performance: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441391/ [19:07:27] https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/438127 looks like chad uploaded it with CR+2/V+2 and submit [19:07:42] RECOVERY - High load average on labstore1005 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:08:58] afaict that's normal on operations/software that you manually V+2 and submit [19:09:22] operations/software/gerrit maybe not all others [19:09:57] there isn't a submit button even after manually v+2ing [19:10:18] i see one [19:10:24] I guess that means I don't have submit rights [19:10:25] Is that that stupid bug gerrit bug that needs an admin? [19:10:49] I think that I am a gerrit admin, no? [19:11:13] want me to click the submit button? i didn tthink i had more permissions than you though [19:11:37] that'll be up to thcipriani ;) I guess submit it? [19:11:41] yes please. I'll checkout permissions after the upgrade, I guess [19:12:17] merged [19:12:51] thanks [19:13:49] you are listed in gerrit admins directly. i only inherit it from "ops" group [19:14:06] all the permissions for that repo are "admins" which includes thcipriani and myself. [19:14:14] both 20after4 and Thcipriani, ack [19:14:43] mutante: do you happen to see a submit button on https://gerrit.wikimedia.org/r/#/c/operations/software/gerrit/+/448063/ as well? That's the one I need to deploy. [19:14:58] I guess it's weirdness with the inheritance hierarchy? [19:15:02] yes, i do [19:15:14] could you submit that one please? [19:15:20] merged that too [19:15:32] the other one wasnt wrong, right? also had your +1s [19:15:57] the other one needed to be merged as well, but it's the source from which we built the deploy branch [19:16:04] ok,ack [19:16:43] alright, so I think I'm going to deploy to gerrit2001 first just to make sure there aren't surprises with scap or git-fat [19:16:48] Reedy bug? [19:16:57] that's only for wips [19:17:03] Gerrit is full of bugs [19:17:11] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:17:22] submit should be available on ops/software/gerrit now [19:17:31] I explicitly added it for admins [19:18:05] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T199636 (10Marostegui) 05Open>03Resolved ``` 19:17 < icinga-wm> RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy... [19:18:39] first gerrit2001 , +1 [19:18:39] hmm, when i git fat pull (sizes for jars are 74) [19:19:06] 74 would indicate thta git fat pull iddn't work [19:19:14] this is on https://gerrit.wikimedia.org/r/448063 [19:19:16] that's the magic size of a git-fat file [19:19:28] thcipriani: I'd like to backport https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/448107/; let me know when gerrit is done please. [19:19:36] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10Marostegui) @jcrespo how do you feel about closing this task as resolved? [19:19:38] git fat init; git fat pull should be all you need [19:19:40] AaronSchulz: will do [19:19:56] no way I'd deploy during ;) [19:20:05] good call :) [19:21:11] ah [19:21:12] works now [19:21:28] and jar size has decreased, due to elasticsearch changes and removing unused deps. [19:22:02] yep, I just verified locally as well [19:22:09] ok, going ahead with gerrit2001 [19:22:25] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@b69b468]: deploy gerrit 2.15.3 to cold-replica [19:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:34] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@b69b468]: deploy gerrit 2.15.3 to cold-replica (duration: 00m 10s) [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] note that with gerrit2001, you will only be able to verify the size of files, it won't start [19:22:52] due to the db [19:23:00] it cannot access the db in codfw [19:24:31] sure, wanted to verify that scap worked as expected and that git-fat worked as expected in production and that seems to be the case [19:25:08] twentyafterfour: anything you want to check? ready to move forward to cobalt? [19:26:31] it's not mentioned in the docs, but are the icinga checks we need to disable? [19:27:02] thcipriani: good question [19:27:02] i'll do that [19:27:17] thank you [19:27:23] I can't think of anything else but icinga definitely... [19:27:53] there are: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cobalt&service=gerrit+process [19:28:07] and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cobalt&service=HTTPS [19:28:35] thcipriani mutante twentyafterfour it works https://gerrit.git.wmflabs.org/r/ [19:28:38] (just upgraded) [19:28:42] scheduled downtime now [19:28:46] (2 hours?) [19:28:58] seems good [19:29:08] hopefully we won't need it [19:29:09] oh, we have more, the one for gerrit ssh [19:29:41] done [19:30:02] cool, I'll update the docs when we're done [19:30:12] ok, going live on cobalt [19:30:14] fyi: 3 checks, https on gerrit (http connection itself and also cert expiry and tls ), the gerrit process running and the gerrit-ssh port being open [19:31:07] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@b69b468]: upgrade to gerrit 2.15.3 [19:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:15] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@b69b468]: upgrade to gerrit 2.15.3 (duration: 00m 08s) [19:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:44] may we add that second config change? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441391/ [19:31:52] the one that Chad already +1ed [19:33:36] might as well I think [19:33:39] ^ [19:33:46] (03CR) 1020after4: [C: 031] Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [19:33:48] was just double checking units :) [19:33:48] (03PS7) 10Paladox: Add gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) [19:34:11] needs manual rebase.. sec [19:34:34] k [19:36:04] !log mobrovac@deploy1001 Started deploy [eventstreams/deploy@b1f577d]: Provide better stream and error handling and stop using compression - T199813 [19:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:08] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [19:36:11] (03PS3) 10Dzahn: Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [19:36:51] (03PS4) 10Dzahn: Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [19:36:52] arr, still wrong :p [19:37:30] (03CR) 10Dzahn: [C: 032] Gerrit: Set cache for groups [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [19:38:08] thcipriani: merged on puppet master, if you want to run puppet on cobalt .. go ahead [19:38:17] !log mobrovac@deploy1001 Finished deploy [eventstreams/deploy@b1f577d]: Provide better stream and error handling and stop using compression - T199813 (duration: 02m 13s) [19:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:25] okie doke, I will do that [19:39:01] AaronSchulz: well i hacked in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/448103/1/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php for mwmaint1001 and also added logging to make sure the unlock is called correctly and all seems to be good [19:39:04] !log running puppet on cobalt for gerrit.config update [19:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:09] i dont have an opinion on "make polygerrit the default UI". ;) [19:39:12] I'll sync the actual patch once the current slot is done [19:39:16] jouncebot: now [19:39:16] For the next 1 hour(s) and 20 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [19:39:17] For the next 1 hour(s) and 20 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [19:39:38] addshore: nice [19:40:02] mutante: twentyafterfour ok, time for service gerrit restart [19:40:10] AaronSchulz: still no idea how this happened this morning, still don't see the change between .13 and .14 that caused it, and not sure why the rollback wouldnt have fixed it [19:40:13] paladox: so got the 2 tuning patches in during upgrade.. also doing avatar stuff is asking too much at once i think. and not sure about theme :) [19:40:20] thcipriani: cool !:) [19:40:34] !log restarting gerrit [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:11] yeh (some of the puppet changes can be done later) since some will not require restarting gerrit. [19:41:27] though gerrit now supports reloading some of gerrit.config without a restart. [19:41:38] paladox: ok, and .. nice! [19:42:53] well, we also still have the whole "Gerrit: Move all logging to /var/log" and log4j thing for some time [19:43:06] yeh, need to do a fix to gerrit for that :) [19:43:08] (03PS9) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [19:44:03] paladox: ok, let's clarify that on gerrit too [19:44:30] addshore: that sounds a little worrisome [19:44:53] ok so it seems like the link between gerrit and zuul is fine (antoine was worried about that) still watching logs [19:44:59] we may also want to do https://gerrit.wikimedia.org/r/c/operations/puppet/+/408298 [19:45:19] twentyafterfour: I'v been super confused all day :) [19:45:32] gerrit works for me and shows 2.15.3 :) [19:45:56] it feels faster due to the restart or else [19:46:01] (03PS4) 10Paladox: Gerrit: Set notedb configs to enable notedb [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) [19:46:18] (03PS5) 10Paladox: Gerrit: Set notedb configs to enable notedb [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) [19:46:30] twentyafterfour: im still looking around wondering how this ever working in the first place though [19:46:36] paladox: what about "Link to gerrit-theme.html in scap repo" [19:46:52] needs https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/439503 first [19:46:57] oh right [19:47:22] i am going to have to retest that [19:47:28] to confirm that will work [19:47:43] (03CR) 1020after4: "I thought notedb was already enabled by default" [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) (owner: 10Paladox) [19:48:07] paladox: sometimes i want to set a gerrit change to "stalled" just like a ticket:) [19:48:16] twentyafterfour well, not for existing sites (which we are). It's currently storing the config in notedb.config (un puppitised) [19:48:17] the opposite of WIP [19:48:22] heh [19:48:32] you could use the #hashtag #stalled [19:48:58] +1 for hashtag [19:49:08] that is already enabled :) [19:49:09] un-puppetized .. hmm [19:49:35] (03CR) 10Dzahn: "#stalled" [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [19:49:38] yeh, that file is ment for puppet used because if we puppitised it, the gerrit upgrade would have totally failed. [19:50:31] i spoke with one of google employees, who made notedb.config since if we puppitised it in gerrit.config when it was migrating from the db to notedb it would have broke [19:50:32] so you are saying now after the upgrade... [19:50:40] it could be puppetized..and that change does that? [19:51:00] we needed to wait until notedb.config was created by gerrit [19:51:09] to be able to puppitise it in gerrit.config [19:51:17] since that ment that it migrated everything [19:51:30] and that is the case now? [19:51:56] yep [19:51:56] after a handful of manual tests of things while watching logs, I'll call the upgrade done [19:52:04] :) [19:52:15] nice! [19:52:42] mutante: paladox thanks for your help! [19:52:47] your welcome :) [19:52:47] twentyafterfour: nice work [19:52:52] thcipriani: welcome! [19:53:09] thcipriani: so it sounds like we should probably also do that notedb change paladox explained above [19:53:22] if that config file has been generated now [19:53:35] per commit message on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/408298/ [19:53:40] yep, once that is merged we remove notedb.config at the same time. (or cp it to a backup location) [19:54:08] woo [19:54:11] and groups is now fast! [19:54:13] how do we know it is in "dual mode"? [19:54:53] [noteDb "changes"] [19:54:53] autoMigrate = true [19:54:55] sets dual mode [19:55:02] but notedb.config overrides it [19:55:10] (which is good otherwise gerrit would be broken) [19:55:19] so this just moves what's in notedb.config to gerrit.config? [19:55:22] yep [19:55:38] the google employee was kind enough to do that for us :) (when i brought up puppet) [19:56:50] ok so the procedure would be to deploy the gerrit.config change, then mv notedb.config ~/notedb.config.bak, then...do we need to restart again? [19:57:07] yep [19:57:13] to confirm it dosen't break anything. [19:57:21] though it shoulden't. [19:57:40] yeah, seems like the same settings that are already enabled. [19:58:21] yep [19:58:28] !log mobrovac@deploy1001 Started deploy [proton/deploy@883cacd]: Use a more secure MW API template - T198461 [19:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:32] T198461: Proton cannot assume the requests are for {lang}.wikipedia.org - https://phabricator.wikimedia.org/T198461 [19:59:02] !log mobrovac@deploy1001 Finished deploy [proton/deploy@883cacd]: Use a more secure MW API template - T198461 (duration: 00m 33s) [19:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:15] so this: https://gerrit-review.googlesource.com/Documentation/note-db.html#_configuration [20:00:03] twentyafterfour: mutante do we want to do ^ now since we still have time in the window? [20:00:18] thcipriani: up to you [20:00:22] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [20:01:19] thcipriani: yes, i think so [20:01:29] alright, let's do it [20:01:41] because it was tied to "once 2.15 writes the config" and unpuppetized. as far as i understand [20:01:44] ok [20:02:03] (03CR) 10Dzahn: [C: 032] Gerrit: Set notedb configs to enable notedb [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) (owner: 10Paladox) [20:02:15] i confirmed on https://gerrit.git.wmflabs.org/r/c/Scripts/+/937/ [20:02:18] that it should work. [20:02:39] i merged it on the master, i'll let you apply on cobalt [20:02:47] 10Operations: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10BBlack) https://github.com/vincentbernat/ip_vs_mh is available as a backport, or will be in 4.18 upstream releases. [20:03:31] !log running puppet on cobalt for gerrit notedb changes [20:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:55] hi! I am idling around, ping me as needed though I will probably not be super responsive. [20:04:21] jouncebot: now [20:04:21] For the next 0 hour(s) and 55 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [20:04:21] For the next 0 hour(s) and 55 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [20:05:26] ok, I've run puppet, the notedb changes are in gerrit.config, I've removed notedb.config [20:06:00] :) [20:06:11] !log restart gerrit for notedb config update [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:40] it came back :) [20:08:50] and I can still login [20:09:11] :) [20:09:14] i just logged in freshly [20:09:18] paladox: kudos [20:09:21] :) [20:10:04] (03PS10) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [20:10:05] woo! thanks everyone [20:10:12] :) we have even more stuff but maybe we should not jinx it and stop now. hehe [20:10:21] agreed. [20:10:23] :) [20:11:16] !log gerrit upgrade to 2.15.3 complete [20:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:19] well done, folks! [20:11:29] AaronSchulz: gerrit upgrade complete [20:11:42] ejegg: thanks :) [20:11:50] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [20:11:58] good [20:12:29] thcipriani: am I free to sync something now :D [20:12:35] !log set merge.policy.reclaim_deletes_weight back to 2.0 for wikidatawiki_content [20:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:38] addshore: all yours [20:12:42] (03PS1) 10Dbarratt: Enable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 [20:12:42] that kind of icinga alert happens .. no worries [20:12:52] it's just that puppet tried to git pull during gerrit restart [20:13:10] right, +2ed https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/448113/ and will wait for it to merge [20:13:11] runs it on thorium [20:13:12] (03PS2) 10Dbarratt: Enable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 [20:15:05] thcipriani or twentyafterfour wondering if you could review https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/439503 please? :) [20:15:36] i have just tested and works alongside the polygerrit plugin (we made). But i found there's gerrit-theme.html [20:15:47] also adds CoC and privacy policy to the footer [20:16:29] cool, I'll check it out [20:16:33] (03CR) 1020after4: [C: 031] Add gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [20:16:48] thanks! [20:16:49] paladox: lgtm [20:16:52] :) [20:17:00] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:17:05] i have it running @ https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/439503?polygerrit=1 [20:17:13] uh [20:17:19] https://gerrit.git.wmflabs.org/r/c/test/+/3?polygerrit=1 [20:17:22] that one ^^ [20:17:28] (footer) [20:19:10] paladox: it doesn't quite look right on https://gerrit.git.wmflabs.org/r/#/q/status:open [20:19:17] ah [20:19:20] your using gwtui [20:19:22] the privacy policy doesn't get a | to the right of it [20:19:26] (03PS2) 10Dzahn: backup: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448083 [20:19:27] that is another test i am doing [20:19:39] https://gerrit.wikimedia.org/r/439503 is the polygerrit version [20:19:47] gwtui does not make this easy to implement :) [20:20:14] if you click on the switch to new ui [20:20:16] gwt doesn't make much easy, does it? [20:20:19] on the future [20:20:27] twentyafterfour the way it generates it's html [20:20:38] and how it loads the file that we use to change things [20:22:10] twentyafterfour this is what it looks like in pg https://phabricator.wikimedia.org/F24164205 [20:22:23] google hacked there version compared to me creating a plugin for this :) [20:24:33] * mutante awards token 'less-hacky-than-google' :) [20:25:33] heh [20:27:29] (03CR) 10Dzahn: [C: 032] backup: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448083 (owner: 10Dzahn) [20:30:30] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [20:30:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [20:32:02] (03PS1) 10Mobrovac: EventStreams: Use the default log level (warn) [puppet] - 10https://gerrit.wikimedia.org/r/448152 [20:39:50] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php: Use getClientLockName value for releaseClientLock when dispatching T200420 (duration: 00m 57s) [20:39:53] finally [20:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:55] T200420: Wikidata dispatching stuck (not releasing lockmanager locks) - https://phabricator.wikimedia.org/T200420 [20:43:43] addshore: done with the deploy? [20:43:50] yup! [20:43:53] * addshore is all done [20:44:21] cool, I'll do the morning SWAT then [20:46:15] (03PS3) 10Gergő Tisza: Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 [20:46:25] jouncebot: now [20:46:25] For the next 0 hour(s) and 13 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [20:46:25] For the next 0 hour(s) and 13 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T1900) [20:48:04] (03CR) 10Gergő Tisza: [C: 032] Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 (owner: 10Gergő Tisza) [20:49:15] (03Merged) 10jenkins-bot: Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 (owner: 10Gergő Tisza) [20:49:28] (03CR) 10jenkins-bot: Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 (owner: 10Gergő Tisza) [20:49:54] tgr: can you do https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/448107/ while at it? [20:50:41] AaronSchulz: sure [20:52:23] (03PS7) 10Gergő Tisza: Add interface-admin to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) [20:54:55] (03PS4) 10Dzahn: postgresql: add defined type to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [20:55:04] (03CR) 10Dzahn: postgresql: add defined type to create db backups (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [20:55:51] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: Do not set deprecated value for $wgExternalDiffEngine (duration: 00m 54s) [20:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:01] (03PS8) 10Gergő Tisza: Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) [20:56:28] (03CR) 10Gergő Tisza: [C: 032] Add interface-admin to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [20:56:49] tgr: thanks [20:58:02] (03CR) 10Gergő Tisza: [C: 032] Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [20:58:22] (03Merged) 10jenkins-bot: Add interface-admin to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [20:58:52] (03CR) 10jenkins-bot: Add interface-admin to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [20:59:07] (03Merged) 10jenkins-bot: Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [20:59:22] (03PS4) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [21:01:44] (03PS5) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [21:02:18] (03PS5) 10Gergő Tisza: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) [21:03:14] (03CR) 10jenkins-bot: Temporarily preserve sysops' JS editing ability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [21:10:01] (03CR) 10Gergő Tisza: "This adds a bunch of untranslated (nonexistent) rights to lists like Special:ListGroupRights. Would have been better to SWAT it just befor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421123 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [21:10:35] (03CR) 10Dzahn: netbox: add psql dump cron and back it up (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [21:11:00] (03CR) 10Gergő Tisza: [C: 032] Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) (owner: 10Gergő Tisza) [21:11:10] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: T190015 gerrit 421122 + 421123 prepare for interface-admin group (duration: 00m 55s) [21:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:10] (03Merged) 10jenkins-bot: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) (owner: 10Gergő Tisza) [21:12:27] (03CR) 10jenkins-bot: Update custom domain for thumbnail rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448021 (https://phabricator.wikimedia.org/T200346) (owner: 10Gergő Tisza) [21:12:40] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:12:52] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T190015 gerrit 421122 + 421123 prepare for interface-admin group (duration: 00m 54s) [21:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:18] wait, wait.. looking at the helium puppet issue [21:14:22] i was working on that [21:15:27] (03PS1) 10Urbanecm: Update hewikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448154 (https://phabricator.wikimedia.org/T200296) [21:17:41] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:18:50] jouncebot: next [21:18:50] In 1 hour(s) and 41 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T2300) [21:18:57] (03CR) 10Dzahn: [C: 032] deployment_server: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447939 (owner: 10Dzahn) [21:19:04] (03PS3) 10Dzahn: deployment_server: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447939 [21:19:41] has grafana always dispalyed in the user timezone and not UTC? O_o [21:19:43] !log tgr@deploy1001 Synchronized wmf-config/ProductionServices.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 55s) [21:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:47] T200346: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 [21:21:14] !log tgr@deploy1001 Synchronized wmf-config/LabsServices.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 54s) [21:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:49] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 54s) [21:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:02] (03PS2) 10Dzahn: debug_proxy: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447940 [21:32:51] (03PS1) 10Dzahn: wdqs/labs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448157 [21:32:53] (03CR) 10Dzahn: [C: 032] debug_proxy: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447940 (owner: 10Dzahn) [21:32:59] !log tgr@deploy1001 Synchronized php-1.32.0-wmf.14/includes/Title.php: SWAT T200456: Handle $title === null in Title::newFromText (duration: 00m 57s) [21:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:04] T200456: MapCacheLRU::has called with invalid key. Must be string or integer. - https://phabricator.wikimedia.org/T200456 [21:34:00] (03CR) 10Volans: "Thanks a lot for the review @gehel! I'll probably do the changes on Monday, but I wanted to answer your comments and let the discussion co" (0316 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:34:28] (03PS1) 10Dzahn: proton: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448158 [21:34:38] (03PS2) 10Dzahn: osm: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/447941 [21:36:35] Request from 71.202.57.167 via cp4029 cp4029, Varnish XID 31195163 [21:36:35] Error: 503, Backend fetch failed at Thu, 26 Jul 2018 21:36:24 GMT [21:36:51] works now [21:37:34] (03PS2) 10Dzahn: wdqs/labs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448157 [21:39:20] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:40:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:41:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:41:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:41:41] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:42:09] yeah I think that's what I'm running into ^^ [21:42:50] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:43:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:49:50] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:50:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:50:51] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:52:56] (03PS2) 10Aaron Schulz: Revert "Revert "Make all non-test wikis write to both nutcracker and mcrouter again"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) [22:00:24] (03CR) 10Aaron Schulz: [C: 031] services: Convert ProductionServices.php to static array file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443874 (owner: 10Krinkle) [22:07:57] jynus: Hi! I was wondering, if JADE is adapted to the single-wiki alternative implementation (like wikidata, a single repository of judgments), then we might be able to deploy on the x1 timetable, right? [22:12:51] (03PS4) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:13:25] (03CR) 10jerkins-bot: [V: 04-1] Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [22:15:26] (03PS5) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:16:43] (03CR) 10Ayounsi: [C: 032] Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [22:20:31] jynus: Regardless of the timetable part, I’m thinking it’s the best of the compromises. Curious to hear your thoughts! [22:28:24] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10dr0ptp4kt) What's needed? [22:42:05] (03PS3) 10Krinkle: Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [22:42:33] (03CR) 10Krinkle: "I vaguely recall that we may need to do something here for wikitech/labswiki. Can you double-check?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [22:45:38] (03PS4) 10EBernhardson: Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 [22:45:40] (03PS5) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) [22:45:42] (03PS9) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [22:45:44] (03PS7) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [22:45:46] (03PS15) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [22:45:48] (03PS2) 10EBernhardson: Remove support for elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/447564 [22:45:50] (03PS8) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [22:45:52] (03PS28) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [22:45:54] (03PS28) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 (https://phabricator.wikimedia.org/T198351) [22:45:56] (03PS30) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 (https://phabricator.wikimedia.org/T198351) [22:50:40] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10ayounsi) [22:55:46] (03PS1) 10Dzahn: admins: update NDA/MOU expiry date for 4 users [puppet] - 10https://gerrit.wikimedia.org/r/448165 [22:57:21] (03PS1) 10Ayounsi: Depool ulsfo, outages on both redundant links to the site [dns] - 10https://gerrit.wikimedia.org/r/448166 [22:57:49] (03CR) 10Ayounsi: [C: 032] Depool ulsfo, outages on both redundant links to the site [dns] - 10https://gerrit.wikimedia.org/r/448166 (owner: 10Ayounsi) [22:59:14] (03PS1) 10Ayounsi: Depool ulsfo, outages on both redundant links to the site [dns] - 10https://gerrit.wikimedia.org/r/448167 [22:59:27] (03CR) 10Ayounsi: [C: 032] Depool ulsfo, outages on both redundant links to the site [dns] - 10https://gerrit.wikimedia.org/r/448167 (owner: 10Ayounsi) [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180726T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:04:09] !log depool ulsfo, outages on both physical links to the site [23:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:00] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [23:11:26] (03CR) 10BryanDavis: "> I've poked the cloud team to see if there is any way to see if" [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [23:29:05] (03PS4) 10Gergő Tisza: Configure group management for interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 [23:29:07] (03PS9) 10Gergő Tisza: Remove sitewide and user CSS/JS editing from old groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421124 (https://phabricator.wikimedia.org/T190015) [23:29:09] (03PS10) 10Gergő Tisza: Enforce that interface-admin is the only group that can edit non-own CSS/JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421125 (https://phabricator.wikimedia.org/T190015) [23:30:52] (03PS4) 10Aaron Schulz: Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) [23:31:02] Krinkle: I added an exception [23:40:00] (03CR) 10BBlack: [C: 031] "Let's try this next week, assuming ulsfo's gets a link back online by then :)" [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [23:43:16] (03PS2) 10Brion VIBBER: Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) [23:47:39] (03CR) 10BBlack: [C: 031] Facter: add a v4 and v6 default routes fact [puppet] - 10https://gerrit.wikimedia.org/r/437771 (owner: 10Ayounsi) [23:51:16] (03CR) 10Dzahn: [C: 032] admins: update NDA/MOU expiry date for 4 users [puppet] - 10https://gerrit.wikimedia.org/r/448165 (owner: 10Dzahn) [23:56:16] (03CR) 10Krinkle: [C: 031] Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [23:58:57] (03CR) 10BryanDavis: [C: 031] "> Do you know how to kick off a test message manually?" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron)