[05:25:45] 10DBA, 06Labs, 13Patch-For-Review: Make watchlist table available on labs - https://phabricator.wikimedia.org/T59617#2724516 (10MZMcBride) >>! In T59617#2384976, @jcrespo wrote: > I have now the watchlist count generator running on all public wikis. Someone just asked me about generating a watchlist-related... [08:31:47] 10DBA, 06Labs, 13Patch-For-Review: Make watchlist table available on labs - https://phabricator.wikimedia.org/T59617#2724627 (10jcrespo) The tables actually are generated physically, but work is being done as of this moment by labs engineers to make those available on all wikis. Stay tuned. CC @chasemp [08:34:12] I have acked, as per faidon request, warning adding https://phabricator.wikimedia.org/T148373 text [08:41:52] I am going to provision db1053 using db1052, will perform maintenance and reboot on both [08:47:43] I need to use db1064, too, for copying, please, marostegui ping me when you are done so I can use it before repooling [08:49:53] jynus: sure [08:50:07] I am going to wait a bit longer to make sure it is depooled [08:50:13] I can take care of repooling myself [08:50:16] as per your comment that it can take hours [08:50:17] cool [08:50:35] looks ok to me now [08:50:48] it literally takes hours [08:51:09] but it seems there is no ongoing vlow or dump activity at the moment [08:51:42] https://tendril.wikimedia.org/activity?wikiuser=0&research=0&labsusers=0 [08:51:51] so go on with the alter [08:52:03] I will go ahead then [08:52:03] I will copy things after the alter [08:52:13] ok, I will ping you once it is done [08:52:16] you are not too careful :-) [08:52:20] *now [08:52:29] which is good [08:55:06] haha [08:55:09] I am still scared eh [08:55:16] I guess less than the first week [08:55:20] the alter is running ow [08:55:21] now [10:42:06] remember to force the upgrade even after the import tablespace [10:42:24] I would not trust the server [10:43:02] although it is surprisingly slow [10:54:42] 10DBA, 06Labs, 10Labs-Infrastructure: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2724798 (10jcrespo) [10:55:18] 10DBA, 06Labs, 10Labs-Infrastructure: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2724813 (10jcrespo) proxysql 1.2.4 is now available on the wikimedia repository (jessie only). [10:59:22] :) [11:01:12] https://gerrit.wikimedia.org/r/316541 [11:01:51] I do not know that to call the role, dbproxy (and possibly contain in the future proxysql and haproxy ? [11:15:50] I would go for dbproxy indeed [11:16:13] If at some point we bevielve proxysql will be what we will go for…then it will make sense to call it proxysql [11:16:19] but for now we are only testing no? [11:16:39] actually, I ended up with labs/db/proxy [11:36:47] isn't the import tablespace taking a lot of time? [11:37:51] The huge tables are [11:37:57] I will give a summary in the ticket [11:38:03] As I have seen something I do not like too much [11:38:21] I am also doing this btw: https://wikitech.wikimedia.org/wiki/MariaDB#Importing_table_spaces_from_other_hosts_with_multi_source_replication [11:38:43] I will update the ticket once the import is done (and I am not sure if it will be done with success) [11:38:48] Stay tuned :p [11:38:51] that is great, marostegui ! [11:38:54] really great! [11:39:24] I am still writing some stuff, but thanks! :) [11:39:45] as it is so large, maybe moving it to a subpage [11:39:57] and just doing the summary on the main page [11:40:46] Ah, that is a good idea :) [11:40:53] I will do it [11:40:55] To keep it clean [11:41:08] well, it is not very clean now [11:41:24] but I have been adding things on subpages myself, too: https://gerrit.wikimedia.org/r/316541 [11:41:26] ups [11:41:34] like this: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:42:05] Yeah, it makes sense [11:42:11] It is a lot easier to focus that way [11:42:37] moving a summary to https://wikitech.wikimedia.org/wiki/MariaDB#Loading_Data [11:44:02] "Be careful, running the ALTER TABLE, as is, takes Query OK, 629385781 rows affected (5 days 54 min 6.51 sec) for the English wikipedia" [11:44:05] ^ [11:44:42] OMG [11:44:49] Well, you saw my face yesterday during our hangouts [11:44:50] hahaha [11:44:55] I wa like: he must be kidding [11:46:01] I remember doing that while I was at the mysql conf last year [11:46:19] Imagine you forget to run in on a screen [11:46:19] XD [11:46:32] well, I run it from localhost [11:46:37] and at the same time [11:46:44] there was a router failure [11:47:03] which made us lose network [11:47:24] XDDDDDDDD [11:47:25] imagine it it had been reverted [11:48:20] not sure if you are editing, let me give you some help with the formatting [11:48:35] Please, feel free :) [11:49:00] https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=909187&oldid=909172 [11:51:16] Ah, thanks - yeah missed that [11:51:18] Thank you [11:51:37] there are some small issues on s4-master [11:52:05] checking [11:52:26] mostly rpc-related [11:53:23] Did you catch that in logtash? [11:53:28] yes [11:53:40] some "Lock wait timeout exceeded; try restarting transaction" [11:53:44] not pathologic [11:53:52] but it seems more loaded than usual, maybe [11:54:14] Using the mediawiki and dbquery filters? [11:54:48] this is what I am looking at now: https://logstash.wikimedia.org/goto/c9cae07d9ef0cf3e9b194e2770cf550b [11:55:22] if you click on dbserver, 10.64.16.29 is unusually high in percentage [11:55:32] more if you change the -rpc for * [11:56:08] I think it is the refreshLinks job [11:56:18] Ah I see [11:56:33] I always have the mediawiki and DBquery graph opened [11:56:42] And didn't see anything too weird there [11:56:43] the mediawiki is useless [11:56:44] interesting [11:56:52] too many different logs [11:57:13] So you don't normally do not add it? [11:57:18] I will kill it right now :) [11:57:47] I monitor fatalmonitor [11:57:53] after a deploy [11:58:13] but there is a lot of things that do not go there, and lots of things that go there that are not useful [11:59:07] Ah, fatalmonitor nice :) [11:59:20] * marostegui trying to see matrix with jynus [11:59:53] yes, it is just one year seeing what fails normally [11:59:57] and what doesn't [12:45:17] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2724798 (10jcrespo) a:03jcrespo [12:59:51] 10DBA, 06Operations, 10ops-codfw: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2725037 (10elukey) p:05Triage>03High [13:35:42] morning jynus! hey thanks for the note on role paths, do you think I should move modules/role/manifests/labsdb/views.pp to modules/role/manifests/labs/db/views.pp? [13:36:20] I think yuvipanda is the most vocal on that case, I would do what he suggests [13:36:40] I do not care, as long as it is the same for all [13:36:43] :-) [13:37:54] BTW, mar*stegui as done a huge progress on documenting the importing process: https://wikitech.wikimedia.org/wiki/MariaDB#Importing_table_spaces_from_other_hosts_with_multi_source_replication [13:38:23] and I am now at full steam with proxy sql: https://gerrit.wikimedia.org/r/316541 [13:46:12] Morning chasemp [13:46:33] jynus: I am now creating the subpage: https://wikitech.wikimedia.org/wiki/MariaDB/ImportTableSpace :) [13:47:39] marostegui, as the alter is taking more than I though, I want to deploy the new init.d everywhere [13:47:56] if you are going to be around in the next 30 minutes [13:48:19] I am going to be :) [13:48:21] Go ahead [13:48:34] it is https://gerrit.wikimedia.org/r/316332 [13:54:05] I would say we should start and stop some servers, but not sure which ones? [13:54:17] jynus: codfw? [13:54:22] yeah [13:54:32] (please do not touch dbstore2001 if possible) [13:54:48] nope [13:54:59] someting from s2-codfw [13:55:09] and I test .27, too [13:55:11] :-) [13:55:15] hahah [13:55:17] fine [13:55:28] do you want me to help with the restarts [13:55:33] ? [13:55:42] I just want to test a couple of different services [13:55:53] nothing massive [13:57:14] db2049 and es2019, for example [13:57:21] ok with those^ ? [13:57:58] yep [14:02:33] jynus: db1064 has finished the PK alter, so you are free to use it [14:02:41] yay! [14:02:52] I will take full ownship of it [14:03:02] and repool it myself when I finish [14:03:28] do you know how many s4 servers are left regarding revision? [14:04:40] jynus: yes, 0 :) [14:04:46] yay! [14:04:51] I am happy today [14:04:51] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2725200 (10Marostegui) db1064 has finished the ALTER table: ``` *************************** 1. row *************************** Table: revision Create Table: CREATE TABLE `revision` ( `rev_id` int(8) unsi... [14:04:58] lots of things done [14:05:05] I am going to double check but I think we are all set for those [14:05:13] jynus: Maybe db1069? [14:05:31] that can be done live [14:05:42] well, has to :-) [14:05:58] Yep, I will do it then. (Because we want it to be done I assume?) [14:06:15] yes [14:06:19] check labs [14:06:29] see how it is [14:06:41] sure, will do too now [14:06:55] we do not care much [14:07:02] because of the new machines [14:07:06] but at least know it [14:10:32] We would need to do: labs, dbstores and db1069 [14:11:50] non of those need depooling [14:12:02] I would do at least db1069 [14:12:09] the other are going to be substituted soon [14:12:12] *s [14:12:22] probably not worth it, but your call [14:12:26] Awesome, I will do db1069 [14:12:29] The rest I won't do [14:12:36] good [14:13:12] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2725225 (10Marostegui) All the core servers have been done. Pending servers are: ``` labsdb1001 labsdb1003 db1069 dbstore2002 dbstore1001 dbstore1002 ``` @jcrespo and myself have agreed on doing `db1069` only as... [14:14:45] db1069 it tokudb [14:14:48] *is [14:14:54] it may take a day or more [14:15:06] beware^ [14:21:57] oh that is true [14:22:10] :_( [14:49:00] 10DBA, 06Operations, 10ops-codfw: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725318 (10elukey) p:05Triage>03High [15:04:05] jynus: I'm trying to grok this watchlist table ticket and I think I figured out the history and I'm wondering where this is https://phabricator.wikimedia.org/T138450#2401133 [15:04:10] the watchlist_count generator [15:04:14] mainly for my own clarification [15:04:26] so this will be a labs-only table [15:04:40] which may sound stupid [15:04:46] but it actually makse sense [15:05:01] I meant where is the logic that generates that table? [15:05:05] we cannot share the watchlist, because it is persona data [15:05:29] so we generate it on db1069 [15:05:58] so we have a watchlist table natively, and something on db1069 curates that into a watchlist_count table and we need a watchlist_count_p view [15:06:10] yes [15:06:41] where is the logic that curates that table into watchlist_count? I mean a script or a cron or something I can see to understand better [15:24:14] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725515 (10Papaul) a:05Cmjohnson>03Papaul [15:32:28] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725636 (10Papaul) Below are the step taken to troubleshoot this issue. 1- Swapped CPU 1 to CPU2 2 - Update BIOS... [15:34:46] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2725676 (10chasemp) [15:35:23] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725681 (10Marostegui) MySQL is back up and replicating Thanks @Papaul [15:41:16] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2725688 (10Marostegui) I have finished importing S1's enwiki tablespace. There are some things that need to be mentioned. The following command: ``` for i in `mysql --skip-ssl enwiki -e "show tables;" -B`; do... [16:09:51] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2725772 (10Cmjohnson) Thanks @Papaul, let me know if the error returns and where. [16:16:31] 10DBA, 06Operations, 10ops-codfw: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725805 (10Papaul) a:03Papaul [16:28:11] 10DBA, 06Operations, 10ops-codfw: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2722693 (10Papaul) @jcrespo there is not enough information to troubleshoot or creating a case with the information provide. [16:40:29] 10DBA, 06Operations, 10ops-codfw: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2725880 (10Papaul) Will Have the replacement disk on site tomorrow. Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms you... [16:42:07] 10DBA, 06Operations, 10ops-codfw: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2725894 (10jcrespo) [16:42:09] 10DBA, 06Operations, 10ops-codfw: es2014 raid controler temporary failure - https://phabricator.wikimedia.org/T148434#2725892 (10jcrespo) 05Open>03Resolved a:03jcrespo [17:39:00] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) [17:40:52] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2726240 (10chasemp) [17:40:56] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#609458 (10chasemp) [17:40:57] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2726242 (10chasemp) [17:41:01] 10DBA, 06Labs, 10Labs-Infrastructure: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2400728 (10chasemp) 05Open>03Resolved I'm resolving this as I believe {T148560}, {T147302} and {T59617} are now the relevant work tasks. Big thanks to @Krenair [17:45:05] 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Replicate ores_classification and ores_model on labs - https://phabricator.wikimedia.org/T148561#2726255 (10Ladsgroup) [17:45:29] 10DBA, 06Labs, 10Labs-Infrastructure, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service: Replicate ores_classification and ores_model tables in labs - https://phabricator.wikimedia.org/T148561#2726270 (10Ladsgroup) [17:52:49] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2726306 (10jcrespo) Note- pt-table checksum is ok (although it may need a patch to work) but pt-table-sync corrupts data with our config, make sure you do not use it. [17:57:31] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2726317 (10Marostegui) Thanks for the heads up. At the moment I am planning only to use pt-table-checksum but it is good (and quite important) to know about pt-table-sync. Much appreciated! [18:40:26] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10AlexMonk-WMF) (a script run would also handle wikimania2017wiki_p and tcywiki_p which are currently missing) [18:40:41] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2726519 (10chasemp) I think with T148560 this is acheivable [18:42:04] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) >>! In T148560#2726515, @AlexMonk-WMF wrote: > (a script run would also handle wikimania2017wiki_p and tcywiki_p which are currently m... [19:15:31] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726654 (10chasemp) a:05chasemp>03None [19:16:05] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2726213 (10chasemp) I'm not sure what the right steps are to do this through Puppet so I'm hoping to connect with one of the #DBA folks to knock it out so... [20:00:36] manuel, when you read this, db1053 and db1064 should be healthy and up to date with replication [20:00:57] however, I have not yet applied the partitioning to db1053 [20:01:14] so it cannot be set as rc role yet [20:02:00] I have not pooled back db1064, because it will need some time, but you could do that (vslow, dump) by tomorrow [20:02:33] keep db1053 depooled and I will apply the partitining on thursday [20:02:50] or, alternatively, just keep things are they are now [20:03:30] I will have to rebase this https://gerrit.wikimedia.org/r/315975 to the latest version [20:49:09] 10DBA, 10RESTBase-Cassandra, 13Patch-For-Review, 06Services (doing): Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2726997 (10Pchelolo) All restrictions and deletions has be imported. [21:29:03] 10DBA, 10RESTBase-Cassandra, 13Patch-For-Review, 06Services (doing): Import page restrictions to Cassandra restriction table - https://phabricator.wikimedia.org/T135278#2293930 (10Pchelolo) 05Open>03Resolved a:03Pchelolo The rest of the work will be tracked under T148592, resolving