[02:15:47] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3116785 (10Nuria) Ping @jcrespo @Jdforrester-WMF will be adding these schemas to blacklist , so it... [02:17:50] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3116787 (10Nuria) cc @Marostegui [02:49:23] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3116826 (10Jdforrester-WMF) Sounds good. Thank you! [06:51:06] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3116946 (10Marostegui) Sounds good! Thank you @Nuria once it is merged I will drop it again and se... [06:53:50] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3116947 (10Marostegui) db2051 and dbstore1001 are done ``` root@neo... [06:54:12] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3116948 (10Marostegui) db2051 and dbstore1001 are done ``` root@neodymium:~# for i in db2051.codfw.wmnet dbstore1001.eqiad.... [07:33:18] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3117008 (10Marostegui) >>! In T160454#3114957, @Nuria wrote: > @Marostegui I think @otto can run script in our end, let us know if that is OK with you and we will take a s... [07:45:13] 10DBA, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntitiesWithoutTermFinder::getEntitiesWithoutTerm can take 19 hours to execute and it is run by the web requests user - https://phabricator.wikimedia.org/T160887#3113697 (10Marostegui) I am afraid this not only runs on vslow slaves. db1070 is not receiving vslow tr... [07:46:21] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3117015 (10Marostegui) I see the server is still down, @Papaul did the technician finally show up yesterday? [09:24:01] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3117158 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1092.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag... [09:36:11] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Add wl_timestamp to the watchlist table - https://phabricator.wikimedia.org/T125991#3117206 (10Addshore) 05Open>03stalled [09:41:41] DBA-ping: I have a couple of questions when you'll be around :) [09:45:56] 10DBA, 13Patch-For-Review: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3117238 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1092.eqiad.wmnet'] ``` and were **ALL** successful. [09:49:52] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3117242 (10jcrespo) [10:02:56] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3117278 (10jcrespo) [10:32:55] 10DBA: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3117405 (10Marostegui) jawiki has been finished and there are no differences! \o/ [11:35:18] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3117514 (10Marostegui) db2044 and labsdb1009 are done: ``` root@neo... [11:35:52] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117515 (10Marostegui) db2044 and labsdb1009 are done: ``` root@neodymium:/home/marostegui/databases_s6# for i in db2044.co... [11:48:06] 10DBA, 06Labs: Prepare and check storage layer for kbp.wikipedia.org - https://phabricator.wikimedia.org/T160869#3117533 (10Dereckson) p:05Triage>03Low [11:59:38] jynus, marostegui: DBA-ping :) I have a couple of questions when you've a minute [12:24:58] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117611 (10Marostegui) I am investigating why labsdb1009's replication is complaining about this ``` Last_SQ... [12:28:29] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117625 (10jcrespo) Oh, yes, that is binlog row being pendantic. There may be a temporary way to disable that behaviour. [12:33:48] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117632 (10Marostegui) This might be the thing to play with: https://dev.mysql.com/doc/mysql-replication-excerpt/5.7/en/rep... [12:38:02] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117641 (10jcrespo) Yes, that looks like it- I would do a quick test to confirm it works well- then apply it (either tempor... [12:39:51] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117642 (10jcrespo) It could also be our fault- the mime may be only internally a varchar, and we may be explicitly convert... [13:07:05] 10DBA: Check OATHAuth tables are in each private or fishbowl wiki - https://phabricator.wikimedia.org/T160991#3117676 (10Dereckson) [13:09:06] 10DBA: Check OATHAuth tables are in each private or fishbowl wiki - https://phabricator.wikimedia.org/T160991#3117689 (10Dereckson) [13:56:01] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#3117783 (10Aklapper) >>! In T126832#3057932, @Dereckson wrote: > I ask greg a green light for a window Tuesday 11am CET? Any new deployment window on the horizon? :) [13:58:40] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#3117787 (10Dereckson) @greg-g @jcrespo I suggest the following approach: we note it at the top of the calendar for the 27th week, so it's announced, and we can try to sched... [14:04:59] 10DBA: Check OATHAuth tables are in each private or fishbowl wiki - https://phabricator.wikimedia.org/T160991#3117676 (10Reedy) When was projectcomwiki created? AFAIK, I created the tables on all private wikis when I was deploying it in response to the hacking attempts [14:07:11] 10DBA: Check OATHAuth tables are in each private or fishbowl wiki - https://phabricator.wikimedia.org/T160991#3117816 (10Reedy) 05Open>03Resolved a:03Reedy Yeah, they exist on all private and fishbowl wikis [14:08:01] 10DBA: Check OATHAuth tables are in each private or fishbowl wiki - https://phabricator.wikimedia.org/T160991#3117821 (10Dereckson) In August. Thanks to have checked. [14:11:18] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3117849 (10Anomie) [14:11:38] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3117242 (10Anomie) > (is is ok to add an index just for a single, infrequent query?) As long as it's ok with the DBAs, sure. ;) > Rewrite the query (how?) The query here is asking "If... [14:34:25] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117922 (10Marostegui) I have reproduced the error on a test vm and looks like the following mode fixes it. And the informa... [14:34:56] 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Recreate a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832#3117923 (10jcrespo) As I said, I am ready. Send me a calendar invite or ping me well in advance so I can do the DB stuff. [14:36:14] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3117924 (10jcrespo) +1 to deploy it. [15:04:31] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3117977 (10jcrespo) > As long as it's ok with the DBAs, sure. ;) The DBAs are probably not sure if it is worth it. What it is gained and lost directly with every index not only is diffi... [15:25:08] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3118049 (10Nuria) Ok, will be sending e-mail today with the change we are going and we can plan on doing it Thursday (48 hrs notice) [15:39:59] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3118069 (10Nuria) We have merged change and restarted processors, tables can be deleted (cc @Maros... [15:47:45] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics, 13Patch-For-Review: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3118102 (10Marostegui) >>! In T141407#3118069, @Nuria wrote: > We have merged change and restarted... [15:52:53] volans, ask [15:55:20] jynus: hey, so first quick one, I guess this step is not needed anymore: (days in advance) Warm up databases; see MariaDB/buffer_pool_dump. [15:55:30] linked to https://wikitech.wikimedia.org/wiki/MariaDB/buffer_pool_dump [16:00:04] volans, it is [16:00:09] in fact we talked about that [16:01:10] we have the suspicion that maybe not all parts of the incidents will be solved by the lego script [16:01:21] redit/memcache/apc yes [16:01:33] but external storage maybe still too cold [16:01:50] specially given that es2* hosts have crashed multiple times [16:01:59] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3118132 (10Anomie) Some data: Assuming I did the hive query correctly, there seem to have been 10419936343 API requests so far this month. 1... [16:02:11] we want to warmup them for the largest wikis [16:02:54] but that is not a question O:-) [16:02:57] yep, and then we will able to also see the non warmed ones will fail, so we can be sure wether we need the warm up for future switches or not [16:03:34] it was probably worse when they never have been used before, but still [16:03:47] es databases can get very cold [16:03:55] for older data [16:04:35] anything that could/should be done as part of the automation of the switchover there? [16:04:43] not really [16:04:54] we could write some scripts [16:04:58] * volans would like mysql to allow to dump the buffer pool from a hot host and reload it in a cold one [16:05:16] but it is a mostly heuristic issues at the time [16:05:19] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3118144 (10Anomie) >>! In T160983#3117977, @jcrespo wrote: > In an ideal world, we would have a redirects table- it could even be just a pointer to the page table. That could help other... [16:05:32] we will do something about it, but we do not have metrics about what exactly is needed [16:05:44] we will get metrics when it happens- we didn't gather last time [16:16:21] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3118186 (10jcrespo) That is really nice data! I was thinking of maybe doing dynamic timeout: if there is low concurrency, allow most of them... [16:17:06] volans, was there a second question? [16:19:03] jynus: yes, got sidetracked [16:19:37] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3118203 (10jcrespo) Nice, I didn't realize that! Still you wouldn't know how to rewrite the query in the case of is_redirect=1 :-P ? Index on rd_namespace,rd_title seems useful :-) [16:20:15] the way the semi-sync replication is setup, is defined also by $::mw_primary [16:21:27] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3118204 (10jcrespo) Actually, that wouldn't work, it is the target that is indexed, not the origin. We would need a different redirect table.... [16:21:48] in mariadb/core.pp and I wanted to know if: 1) should I automate also the change of it during the switchover or it will be done later by you 2) if I should force run puppet to have the disk config in sync (I guess the answer is no for 2) [16:21:55] what? [16:23:07] I think that is wrong [16:23:10] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3118206 (10Anomie) rd_from is the pointer to the page table. The rest of the columns are about the target of the redirect. [16:23:16] and I think it was you that left it like that [16:23:58] yeah I remember touching it [16:24:03] it sould be $shard == 'es1' off, master=master, slave=slave [16:24:16] semisync enabled all the time [16:24:18] # Semi-sync replication [16:24:18] # off: for non-primary datacenter and read-only shard(s) [16:24:18] # slave: for slaves in the primary datacenter [16:24:18] # master: for masters in the primary datacenter [16:24:31] so not change is needed [16:25:20] no dependency of $::mw_primary [16:25:40] but right now it is like that in puppet [16:25:54] so I will fix it, and you can forget about that [16:26:03] sounds great! :D [16:26:22] see? fixing the problems you caused a year later still! [16:26:27] :-)))))))) [16:26:33] rotfl! [16:27:08] don't worry about that, it is not a blocker [16:27:22] but thanks for reviewing it [16:27:45] I would bet that semisync is still active on most dbs still [16:27:53] on codfw [16:28:49] it was https://gerrit.wikimedia.org/r/#/c/285649 :D [16:31:25] maybe it was complex to know if a server was master and slave at the same time? [16:31:57] because masters in reality would be "both" rather than master? [16:32:15] I do not know, but I do not see a reason not to enable it all the time on both dcs [16:32:25] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3117278 (10Marostegui) >>! In T160984#3118186, @jcrespo wrote: > That is really nice data! I was thinking of maybe doing dynamic timeout: if... [16:35:22] agree [16:36:00] volans, If you have time, file a ticket as a subtask of the goal and I will probably get it done this week [16:36:11] sure, thanks [16:38:51] 10DBA, 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3118247 (10Volans) [16:38:55] done ^^^ [16:41:26] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3118261 (10jcrespo) > Maybe that can get too complicated. Not necessarily, it is just a few extra queries on: https://phabricator.wikimedi... [16:46:26] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3118282 (10Marostegui) >>! In T160984#3118261, @jcrespo wrote: > Alternatively, we get rid of the events and puppetize pt-kill properly. Yo... [16:48:00] 10DBA, 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3118288 (10jcrespo) This should not take much time, setting up next on the pipeline. [17:43:56] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118453 (10Papaul) a:03Marostegui Main board and CPU2 replacement complete. System back up online. [17:44:24] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118461 (10Marostegui) Thanks Papaul! I will take it from here [17:45:03] marostegui, maybe you can finish your day and I can take it? [17:45:40] jynus: No worries, I am planning to stick around a bit more :), but thanks for taking care of me :) [17:46:03] marostegui, can you invite faidon to the next dba meeting? [17:46:10] yep, will do [17:46:25] I can't promise that I will make all of them, but I can make this one [17:46:27] I wanted to discuss a couple od things and he propsed to join us there [17:46:55] done! [17:46:58] I don't want to distract you from your regular agenda either, we can have a separate one if you prefer that [17:47:18] paravoid: no worries, it is normally an open agenda :) [17:47:51] marostegui, you mean 30 minutes me ranting to you? very open! [17:48:00] hahahaha [17:55:41] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118507 (10Marostegui) @Papaul can you check if the network cable is plugged? The system doesn't have network. I ran: ``` root@es2015:~# rm -fr /etc/udev/rules.d/70-persistent-net.rules ``` As it is an... [17:58:40] dbstore2001 is getting a bit behing with some purges- I will leave it like that unless itgets worse tomorrow [17:59:09] I saw it was running quite some heavy deletes in the morning [17:59:12] when I saw the warning in icinga [18:01:54] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118566 (10leila) @Ottomata do you know if databases in staging in analytics-slave will be copied to some other place if we're decommissioning the suggested machines? I'm aski... [18:04:25] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118585 (10Ottomata) @leila, we can dump and copy to analytics-store, as long as there aren't any database.table name collisions. [18:32:38] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3118697 (10leila) sounds good, @Ottomata. [18:52:36] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118791 (10jcrespo) es2015.codfw.wmnet needs a mysql_upgrade run before restarting replication. BTW, I fixed some things on the new package: now you can run mysql_upgrade correctly on the path. [19:18:49] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118898 (10jcrespo) I have restarted it myself- its ping returned- I assume Papaul did something? [19:24:32] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3118936 (10Marostegui) Papaul just replugged the cable and it works now: ``` root@es2015:~# mii-tool eth0 eth0: negotiated, link ok ``` Thanks @Papaul! [19:34:12] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3119018 (10jcrespo) [19:34:35] jynus: did you do something to es2015? [19:34:53] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3119031 (10jcrespo) I have tried running analyze table on wikidatawiki, with no changes. [19:35:07] I just got it on the install menu??? [19:35:58] Not sure if it is too late or it has already wiped the disks [19:36:02] 10DBA: dbstore2001 creates bad query plans for wikidata.wb_changes - https://phabricator.wikimedia.org/T161024#3119032 (10jcrespo) The structure and indexes are the same, with the only exception being the compression. [19:38:20] marostegui, what? [19:39:26] yeah [19:39:27] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3119071 (10Marostegui) I am not sure what has happened, but something weird and maybe we have lost its data. The server got rebooted itself while I was on it and started to run PXE boot and started the... [19:39:28] check my update [19:39:49] It is booting via PXE [19:40:17] And launching the installation, I have stopped it but not sure if I stopped it on time or it already wipped the disks [19:40:36] sometimes it boots on pxe if the disks are not detected [19:40:43] so the installation fails anyway [19:40:51] but why did it reboot in the first place? [19:41:17] I rebooted it to make sure the link would come back online as it normally would [19:41:30] And it was taking too long and I logged via idrac [19:41:33] to see why [19:41:34] and saw that [19:43:07] the server was a bit weird, NTP was wrong (as it didn't have network) [19:43:20] so I decided to run a reboot to make sure it would reboot finely and in a normal state [19:43:35] but it didn't [19:45:45] it is always booting up via PXE [19:45:59] So there is something wrong [19:46:03] yeah, it may have issues with the disks [19:46:15] can you go into the bios/boot menu? [19:46:24] it may not be detecting the disks [19:46:36] yeah, I am trying, but the idrac latency is crap [19:46:39] try forcing it and it will likely fail [19:46:45] so it always fails on me [19:46:48] and that would mean nothing was lost [19:50:16] I have been able to get on the controller menu, and the disks are there, the bios is being a bit harder to log-in to [19:51:42] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3119113 (10Tbayer) >>! In T160454#3118049, @Nuria wrote: > Ok, will be sending e-mail today with the change we are going and we can plan on doing it Thursday (48 hrs notic... [19:52:59] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3119115 (10Nuria) >BTW, I assume we are going to update https://meta.wikimedia.org/wiki/Schema:EventCapsule beforehand too? We will be doing that once this work is finishe... [19:53:52] ok, it is now trying to boot from disk [19:54:22] we will see what it does [20:00:44] so? [20:01:09] nope [20:01:14] Booting from Hard drive C: [20:01:20] that is what it says [20:01:28] and no movement [20:01:36] the raid is being seeing from the BIOS though [20:01:47] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3119152 (10Marostegui) I forced it to boot from disk, but it is not booting. The RAID looks healthy from the raid controller (and bios) raid menu, the virtual disk is there. But the server isn't booting... [20:02:09] ok, put it down and we will solve it tomorrow [20:02:38] yeah, probably the best idea... [20:03:06] at least it is not that big what we'd need to copy to it, 4TB [20:05:57] server is down [20:06:05] I am going to eat some dinner and rest, it has been a long day [20:07:11] but I think the disks are wipped, because I got a shell when aborted the installation. it is a ram disk, and i wasn't able to mount anything there [20:08:12] let's discuss tomorrow [20:27:33] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3117278 (10Tgr) Is it a good idea to attempt the handling of such problems in the DB layer, where not a lot of origin information is availab... [20:40:43] 10DBA, 10MediaWiki-Database: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983#3117242 (10Tgr) > But even with the best plan, this query would still be problematic when run on a namespace with millions of redirects and few non-redirects. It seems very unlikely tha... [20:58:12] jynus: Fyi, in case you see the DBQueryErrors coming from mw.org / Flow -- we spotted it already, rolling back + filing task [20:58:31] Errors of the sort: Error: 1054 Unknown column 'workflow_last_update_timestamp' in 'order clause' (10.64.16.20) [21:01:40] Curious. No recent db patches to the extension [21:02:23] Yeah I didn't see either, but was pretty spammy on mw.org so reverting + filing task for now [21:07:49] T161040 [21:07:50] T161040: Unknown column 'workflow_last_update_timestamp' in 'order clause' - https://phabricator.wikimedia.org/T161040 [21:08:48] RainbowSprinkles: It's missing a table join [21:09:04] workflow_last_update_timestamp is in flow_workflow [21:09:12] or, the col is just wrong [21:09:15] flow_topic_list is being selected from [21:38:57] 10DBA, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3117278 (10GWicke) Given that we limited API requests to 60s execution time until HHVM broke timeouts temporarily (see T97192), reducing the... [21:43:56] Reedy: Yeah that bug has been around forever and we don't understand why [21:44:17] RoanKattouw: Surely it's around because it's not been fixed? [21:44:18] ;) [21:44:33] Well yes ... but apparently it's a mystery even to people who know the Flow codebase well [21:44:42] (Which tells you something about the understandability of that part of the code...) [21:45:31] It does look a little bit too... abstracted [21:47:55] RoanKattouw: Well I will say, it definitely spiked after the wmf.17 rollout [21:48:06] It's been infrequent enough that I'd never seen the bug before [21:49:34] Odd [21:49:57] I was going to say, this bug isn't even remotely new, wasn't sure why you were reverting over it [21:50:02] Strange that it suddenly spiked [21:50:35] I'll try investigating the backtrace, others have tried and failed before but I've never taken a shot at it so who knows [21:51:20] Well we hit a few other things that are new & mildly worrying too, so rollback was best action anyway [21:51:31] But yeah, definitely showed up in a large batch when we moved to wmf.17 on mw.org [21:59:00] array( [21:59:01] 'sort' => 'workflow_last_update_timestamp', [21:59:01] 'order' => 'desc' [21:59:01] ) [21:59:13] line 552 of container.php looks suspicious [22:02:50] Similar one in purgeaction, but doesn't seem so related [22:09:26] Will look there [22:09:41] It seems from my testing so far that it's only if &vtloffset-dir=rev is passed in [22:09:43] fwd works [22:10:27] o_0 [22:24:29] Curiously, errors all disappeared when going back to wmf.16 [22:42:37] I think that could possibly be coincidence, due to some activity on Extension talk:MobileFrontend. I can still trigger the error by going to https://www.mediawiki.org/w/api.php?action=flow&format=json&page=Extension_talk%3AMobileFrontend&submodule=view-topiclist&vtloffset-dir=rev&vtllimit=10&vtloffset=20170104171433&vtlsortby=updated [22:46:23] Is it just that page, or does the same behavior occur on other Flow pages? [22:47:18] Nope, all pages [22:47:26] Cf: https://www.mediawiki.org/w/api.php?action=flow&format=json&page=Talk:MediaWiki&submodule=view-topiclist&vtloffset-dir=rev&vtllimit=10&vtloffset=20170104171433&vtlsortby=updated [22:52:23] All pages yes [22:52:43] But in practice, in order to trigger it from the UI, you need to have a page with many topics and be notified about one that's not in the top 10 [22:53:13] Reedy's pointer to line 552 may be relevant; I'm still trying to understand the code path here, it's highly abstracted and very convoluted [22:53:32] offset-dir=fwd doesn't produce a similar query for reasons I don't understand yet [22:54:41] if you can get it somewhere easily replicable... You can just change line 552 to something completely stupid and see if it changes in the output query [22:58:24] Yeah good idea [22:59:37] Aha OK that doesn't reflect in the query but it changes the error message substantially, which means you're on the right path there [22:59:38] Exception caught: No index (out of 3) available to answer query for topic_list_id with options [22:59:58] However the options were {\"limit\":10,\"sort\":[\"workflow_last_update_timestamp\"],\"order\":\"desc\"} [23:00:06] So I need to tell something somewhere to join against the workflow table [23:01:04] so it's looking it up in some other array then [23:18:00] If you look at getIndexFor it's iterating over all those indexes [23:18:21] So... I presume for the non rev version, one of the other indexes is answering hte query [23:18:33] Yeah [23:18:35] vtloffset-dir=rev causes it not to be resolved [23:18:44] The strange thing is, that index is hooked up in a way that does the join [23:18:47] So it gets onto that TopKIndex [23:18:52] But somehow the base class ends up being called, not the specific class [23:19:14] public function canAnswer( array $keys, array $options ) { [23:19:14] if ( !parent::canAnswer( $keys, $options ) ) { [23:19:14] return false; [23:19:14] } [23:19:16] $c['storage.topic_list.indexes.last_updated.backend'] is an instance of TopicListLastUpdatedStorage which overrides find() [23:26:54] jynus: https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20170320-Special_AllPages&diff=1748368&oldid=1745046 btw. [23:27:11] (Of note: I changed scratching to scraping.) [23:27:20] I think that's what was intended, but if not, please revert. [23:28:00] RoanKattouw: yeah, see what you mean. The * is in the parent... but not the one it should be running [23:28:01] Though scratching is oddly appropriately and pretty cute. [23:28:21] appropriate * [23:28:51] RoanKattouw: Sure that mapping is definitely as you think it is? Not overriden somewhere [23:29:29] Reedy: ebernhardson is helping me at my desk now, and I think we've found it [23:30:53] Silly stupid fix? [23:36:59] No, I was wrong, but we've at least found the commit that breaks it [23:37:37] Link? [23:37:39] /hash [23:38:08] Commenting on the task [23:44:52] Reedy: https://phabricator.wikimedia.org/T121644#3120326