[06:27:41] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687248 (10Marostegui) Makes sense - I will closed it as resolved and will report back if Percona/MariaDB report some findings. [06:27:53] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687249 (10Marostegui) 05Open>03Resolved [06:36:02] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2687265 (10Marostegui) a:03Marostegui [06:44:23] 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 2 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687274 (10Legoktm) The queries listed in the bug so far all seem to occur for the MassMessage system user "MediaWiki message delivery" (by lookin... [06:48:51] marostegui: ^ [06:49:16] 10DBA, 06Labs, 10Labs-Infrastructure, 10MassMessage, and 2 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2687278 (10Marostegui) Thanks a lot @Legoktm for spending time on this. Hopefully by changing the code as you just did and converting them to Inn... [06:49:24] legoktm: good timing, just commented :) [06:49:58] legoktm: The thing is…we still don't know how can that cause the server to stall and if there will be more cases :_( [06:50:05] (with other queries) [06:50:05] heh :) [06:50:20] yeah...that's totally out of my knowledge [06:50:28] And ours :-) [06:50:41] And looks like out from MariaDB and Percona too, as they are kinda blaming each other now [06:51:19] hah [07:21:37] 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687311 (10Marostegui) This table has been dropped for good in both DCs. [07:21:53] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687314 (10Marostegui) [07:53:38] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687359 (10Marostegui) TO_DROP_hitcounter table has been deleted from both DCs, I believe this ticket can be closed now. [07:59:31] jynus: Were you able to talk to MariaDB/Monty in the end yesterday? [07:59:35] Just curious [07:59:59] I briefly spoke at someone at the mariadb foundation [08:03:17] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687376 (10jcrespo) ``` MariaDB MARIADB db1083 enwiki > SHOW TABLES like '\_counter%';; +-------------------------------+ | Tables_in_enwiki (\_counter%) | +-----------------------------... [08:05:05] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687379 (10Marostegui) Shall I drop it also from S1 then? The original post said S3 wikis, but I can research if it can be dropped from everywhere. [08:08:36] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687380 (10jcrespo) I wrote that, and I meant that I had to recreate them on s3 because I restarted one of those, creating issues there. If a table is dropped from production, we should... [08:11:39] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687393 (10Marostegui) Makes sense - I will review that, rename and if doesn't cause issues drop them. [08:12:22] 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687394 (10Marostegui) This table exists in S1 - check and drop if possible. [08:12:34] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687399 (10Marostegui) [08:12:36] 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687398 (10Marostegui) 05Resolved>03Open [08:36:27] hi! not sure if you've seen it but db1081 / db1084 still saturate their ethernet ports, expected? [08:38:47] godog: I can check, although I am probably missing context that jynus might have :-) [08:40:50] godog: so it spikes from time to time reaching 1Gbps as I see in the graphs? [08:41:23] marostegui: yeah, I noticed it from the emails I'm getting from librenms, we alert on high port utilization [08:41:32] most times it is nothing and/or expected though [08:44:39] Those times were there are network spikes matches when there are spikes of DELETEs and UPDATES [08:44:53] Maybe jobs running at that time? [08:46:17] marostegui: I think the cause is another, it's outbound traffic, not inbound [08:48:18] I can see slow queries too at those times [08:48:28] (I'm looking at db1081 [08:51:29] In db1084 pt-kill killed some queries at around that time [08:52:18] ok, so from what I'm seeing I see a lot of SELECT img_metadata FROM `image` [08:52:21] from ForeignDBFile::loadExtraFromDB [08:52:30] the img_metadata field is a blob [08:52:42] :( [08:52:44] it might justify the high output size [08:54:37] all in commons ofc [08:55:26] ah, mhh it might be related to WLM as it just finished I think [08:55:37] WLM? [08:57:46] the strange thing is that those queries are querying by primary key (img_name)... so shouldn't be slow :( [08:58:31] marostegui: wiki loves monuments, runs in september [09:00:44] Ah, thanks :) [09:01:43] np, yearly reoccurence where there's a spike of uploads to swift :)) [09:01:47] so my question is what is supposed to be put in the img_metadata? [09:02:11] from one random from the slow queries length(img_metadata) is 11MB [09:04:28] mmmh I might have found it... [09:04:32] they are all .pdf [09:05:05] actually it might be the same all the time [09:05:50] mostly the same, but there are others too [09:05:59] so I guess the PDF is stored in the metadata? :( [09:17:03] so we have the explanation of the traffic peaks [09:18:33] now we need to know if it's normal that kind of content and why was queried so many times in a short timeframe [09:21:39] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687625 (10Marostegui) So `_counters`still exist at: S1 enwiki S2 bgwiki bgwiktionary cswiki enwikiquote enwiktionary eowiki fiwiki idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki thw... [09:22:51] I found this https://phabricator.wikimedia.org/T96360 [09:22:51] 10DBA: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487#2687629 (10Marostegui) So `_counters` and `hitcounter` still exist at: S1 enwiki S2 bgwiki bgwiktionary cswiki enwikiquote enwiktionary eowiki fiwiki idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki thwiki trwiki zh... [09:23:01] T96360 [09:23:03] T96360: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360 [09:23:07] godog , marostegui ^^^ [09:23:25] wow,. that is old [09:24:17] I can confirm many are from bots [09:26:46] * volans brb [09:28:26] interesting, the task above seems to cover djvu but not pdf [09:36:30] I don't see a PDF handler here https://github.com/wikimedia/mediawiki/tree/master/includes/media [09:40:48] volans, is that happening again? [09:41:55] jynus: it happened twice tonight and another time yesterday as I can tell from ganglia/tendril [09:42:49] that was probable me doing a schema change? [09:43:00] on depooled slaves [09:43:30] jynus: not, you lost some history I can see you disconnected and reconnected [09:43:46] sorry [09:43:51] 10DBA: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837#2687676 (10Marostegui) Using a slave with not much traffic of each shard I have renamed both tables to make sure there are no replication issues. If in 48h all looks fine, I will go ahea... [09:43:57] so basically most of them are querying img_metadata from image on commonswiki on a PDF document that has a ~13MB metadata field [09:44:19] mostly from bots, there were like 12k select on those, that justify the network saturation in outbound [09:44:36] I've found that is the same behaviour of T96360 [09:44:37] T96360: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360 [09:44:46] just for PDF instead of Djvu [09:45:18] is it related to refreshlinks, or regular queries? [09:45:31] sorry, I with regular I mean webrequest from users [09:45:31] ForeignDBFile::loadExtraFromDB [09:45:38] mostly web bots [09:45:46] google, msn... [09:45:48] ok, thanks [09:45:52] but not only [09:46:05] maybe we should file a new ticket and reference that one [09:46:09] see tendril slow queries for the last 10h on db1081 for example [09:46:13] I think so, I can do it [09:46:42] note that even if it wasn't the schema change [09:46:49] we were on reduced load [09:46:58] 2 out of 3 servers for most of yesterday [09:47:06] that could contribute to it [09:47:25] sure, although 12k select of 13MB is ~156GB ;) [09:47:26] we can also pool more servers as a temporary measure [09:47:42] well, I am not sayin that as a fix [09:47:53] but as a measure to not saturate the link until a fix is done [09:47:57] ofc it helps to spread on more 1Gb cables ;) [09:48:03] yep [09:48:04] yeah [09:48:14] let me open the task [09:48:15] how did you got that? [09:48:26] were you monitoring ethernet stats? [09:48:30] godog see the librenms alarm on port utilization [09:48:33] *saw [09:48:36] and asked here [09:48:37] thanks [09:49:28] maybe you can add a mail rule to put in your inbox the db ones? [09:51:54] I cannot even find that alert [09:52:12] godog ^^^ :) [09:54:26] I can see several alerts, but router ones, not server-specific [09:54:35] well, not routers, switches [09:54:56] I guess is that, port utilization checks are on the switches IIRC [09:55:37] I see it now, the email has as subject "Alert for device asw-a-eqiad.mgmt.eqiad.wmnet - Port utilisation over threshold" [09:55:49] and the host in the body only [09:56:27] yup, that [09:56:40] it doesn't happen often though, mostly for analytics hosts [10:04:49] I checked stats and it is indeed an issue, but I think not an UBN one [10:05:54] I'm putting it high [10:06:01] I will continue monitoring: https://grafana-admin.wikimedia.org/dashboard/db/server-board?panelId=10&fullscreen&from=1475143509645&to=1475575509645&var-server=db1081&var-network=eth0 [10:06:06] just double checking the content is a serialized array [10:12:58] 10DBA, 10MediaWiki-General-or-Unknown, 06Operations: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10Volans) [10:13:36] jynus, marostegui ^^^ [10:13:42] all yours :) [10:14:56] volans: Thanks, now we can really track it :-) [10:16:06] jynus: adds additional tags if you think they are needed, the old one was having also availability and multimedia [10:20:26] well, while it is causing operations issues, I do not think it should be handled at that layer, aside from mitigation [10:50:30] qq - I installed a new memcached on mc2009 in codfw and I played with its new logging capabilities. This is what I see if I type "watch fetchers": [10:50:33] ts=1475577987.200351 gid=580 type=item_get key=global%3AChronologyProtector%3Aaede7f15780989d0f428135c3cf94839 status=not_found [10:51:11] the only ref to "chronology protector" that I found was in https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage :D [10:51:45] do you guys have more info about it? [12:15:52] 10DBA, 06Labs, 10Labs-Infrastructure: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688126 (10Dzahn) [12:16:47] 10DBA, 06Labs, 10Labs-Infrastructure: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688140 (10Dzahn) [12:17:22] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688126 (10Dzahn) [12:41:06] 10DBA, 06Operations: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2688222 (10Marostegui) This table has been renamed to `TO_DROP_email_capture` across eqiad in the following wikis. S1: enwiki S3: testwiki [13:01:02] 10DBA: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2688288 (10Marostegui) [13:28:50] 10DBA, 06Operations, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2688443 (10BBlack) [13:37:10] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688485 (10jcrespo) This is not a blocking step at this point- the process can continue but this must be kept open until the production side of filtering is run. [13:48:42] 10DBA, 06Operations, 13Patch-For-Review: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2688549 (10Marostegui) I have created the HW request to get it removed: https://phabricator.wikimedia.org/T147309 Also @Volans and myself were unsure about whether it can be safely remove from the array... [13:52:34] 10DBA: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2688571 (10Marostegui) [14:08:46] 10DBA, 06Operations, 10ops-eqiad: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2683544 (10Cmjohnson) Replaced disk 0 on db1055...rebuilding Adapter #0 Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C50... [14:09:52] 10DBA, 06Operations, 10ops-eqiad: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2688636 (10Marostegui) Thanks Chris - will check in a few hours and will close this as resolved once it has finished. Manuel [14:48:07] marostegui, remember to check the physical location if possible when moving shards of a server around [14:48:15] jynus: Some handover I have collected today so you can get a full picture of the day: https://phabricator.wikimedia.org/P4153 [14:48:27] oh, you are here [14:48:50] jynus: what do you mean? you are talking about the ticket to replace db1019 with another one for S4 rc service? [14:49:02] yes [14:49:33] ok :) [14:50:53] good news for dbstores, it seems [14:51:00] yeah :) [14:51:04] good [14:51:19] marostegui: FYI our failure "units" are rack, row, datacenter apart the single server ofc [14:51:38] volans: you lost me [14:52:10] about the physical location, if you want to spread for redundancy [14:52:18] aaah gotcha [14:52:33] :) [14:53:26] By the way, I should be able to access racktables.wikimedia.org with my ldap account, right? [14:53:30] Or do I need a separate one? [14:53:35] nope, separate account [14:53:38] Ah [14:53:38] unfortunately [14:53:45] ping rob ;) [14:53:49] Will do :) [14:53:50] Thanks! [14:53:54] or hackit yourself, as I did [14:53:59] rotfl [14:54:02] XDDDDDD [14:54:06] actually I think anyone can do it [14:54:12] let me check [14:54:46] grazzie [15:07:54] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2688832 (10Dzahn) Thank you, i will go ahead with adding it to DNS. [15:23:39] 10DBA, 06Operations, 10ops-codfw: db2017 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T145844#2688904 (10Papaul) a:03Marostegui Disk replacement complete [18:38:05] got a question on two tables in re: to the labsdb stuff [18:38:16] I'm looking at our candidate for view maintenance and I see it adds to views [18:38:20] two views even :) [18:38:26] +abuse_filter_history [18:38:26] +watchlist_count [18:38:57] so interestingly these views exist in the old perl version but are not actually exposed on the labsdb _p's I spotchecked [18:39:11] I'm wondering if both were removed at some point as bad news but the maint script wasn't fixed [18:39:25] and that would mean whatever two new wiki's were done this year probably have them defined [18:39:45] but I'm not sure atm what exactly to think, both of those seems like decent candidates for not exposing to my naive eye [18:45:38] I also don't see abuse_filter_history here https://github.com/wikimedia/operations-software-labsdb-auditor/blob/master/greylisted.yaml#L821 [19:12:17] chasemp: for the abuse filter one... It looks like rows could be exposed based on some flags... Which usually means it's easier to not expose [19:12:30] No idea on watchlist_count [19:13:05] can't see any sign of it our code [19:17:03] thanks Reedy, I dug up https://phabricator.wikimedia.org/T78730#947664 which doesn't equal any specific outcome but I'm guessing someone made the same call at ...some point [19:18:19] Reedy Krenair dug up https://phabricator.wikimedia.org/T123895 [20:52:07] 10DBA, 06Labs, 10Tool-Labs: s51127__dewiki_lists (merlbot) database using 13G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133325#2690378 (10bd808) [21:24:37] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations: Prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2690580 (10MarcoAurelio) p:05Triage>03Normal [21:43:09] 10DBA, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 07FR-2016-17-Q2-Campaign-Support, and 2 others: Spike: Look into transaction isolation level and other tricks for easing db contention - https://phabricator.wikimedia.org/T146821#2690701 (10DStrine) [22:15:43] 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2690863 (10Dereckson) We need `$wgFlowDefaultWikiDb` to be set before to be able to crea... [22:35:22] 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2690926 (10Mattflaschen-WMF) a:03Dereckson