15:31:21 #startmeeting Pulp Triage 2020-12-01 15:31:21 #info fao89 has joined triage 15:31:21 !start 15:31:21 Meeting started Tue Dec 1 15:31:21 2020 UTC. The chair is fao89. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:31:21 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:31:21 The meeting name has been set to 'pulp_triage_2020-12-01' 15:31:21 fao89: fao89 has joined triage 15:31:28 Open floor! 15:31:46 #info bmbouter has joined triage 15:31:46 !here 15:31:48 bmbouter: bmbouter has joined triage 15:31:50 #info ttereshc has joined triage 15:31:50 !here 15:31:50 ttereshc: ttereshc has joined triage 15:31:56 no items on the agenda so far 15:32:08 #info daviddavis has joined triage 15:32:08 !here 15:32:08 daviddavis: daviddavis has joined triage 15:32:31 should we start triage? 15:32:51 I think so 15:33:23 !next 15:33:24 #topic https://pulp.plan.io/issues/7909 15:33:24 fao89: 7 issues left to triage: 7909, 7908, 7907, 7904, 7876, 7857, 5502 15:33:25 RM 7909 - ipanova@redhat.com - POST - Downloader map from repair feature contains only core downloaders 15:33:26 https://pulp.plan.io/issues/7909 15:33:32 #info ipanova has joined triage 15:33:32 !here 15:33:32 ipanova: ipanova has joined triage 15:33:49 #idea Proposed for #7909: accept and add to sprint 15:33:49 !propose other accept and add to sprint 15:33:49 fao89: Proposed for #7909: accept and add to sprint 15:34:00 +1 15:34:09 +1 15:34:19 #agreed accept and add to sprint 15:34:19 !accept 15:34:19 fao89: Current proposal accepted: accept and add to sprint 15:34:20 #topic https://pulp.plan.io/issues/7908 15:34:20 fao89: 6 issues left to triage: 7908, 7907, 7904, 7876, 7857, 5502 15:34:21 RM 7908 - ggainey - NEW - Make sure all exceptions live in pulpcore.plugin.exceptions 15:34:22 https://pulp.plan.io/issues/7908 15:34:33 #info ggainey has joined triage 15:34:33 !here 15:34:33 ggainey: ggainey has joined triage 15:34:48 #info mikedep333 has joined triage 15:34:48 !here 15:34:49 mikedep333: mikedep333 has joined triage 15:35:03 story? 15:35:05 why deprecation is needed? 15:35:05 7908 is to clean up some misplaced exceptions - it'll require a deprecation cycle to complete cleanly 15:35:25 ttereshc: because we're moving an exception, anything using it where it is now would break 15:35:46 fao89: maybe task? 15:35:48 plugins are not allowed to use anything outside plugin api 15:35:56 is it moving within plugin api? 15:36:02 #idea Proposed for #7908: convert to task 15:36:02 !propose other convert to task 15:36:02 fao89: Proposed for #7908: convert to task 15:37:05 ttereshc: that's a good point, I think the one exception I *know* is involved, shoud be in plugins but isn't currently 15:37:13 I see at least one UnsupportedDigestValidationError 15:37:21 it's in pulpcore.plugin.models 15:37:30 ah right, ok 15:37:34 daviddavis++ 15:37:34 ggainey: daviddavis's karma is now 401 15:37:34 I'm all for moving it into one place 15:37:39 just without deprecation 15:38:20 ttereshc: daviddavis put his finger on the problem 15:39:03 what do you mean by that? 15:39:56 ttereshc: the current location is exposed to plugins in pulpcore.plugins.modles, and it needs to move from there 15:40:00 models even 15:40:07 ah ok 15:40:16 so like I'm imagining we add the exception(s) to their final place in 3.9 and then remove them from their old place in 3.10 15:40:35 daviddavis: yupyup, sounds right 15:40:46 yeah 15:40:48 woudl you be able to do this this week? https://pulp.plan.io/issues/7908 15:41:04 daviddavis: yup, it's on my list for 'asap' 15:41:11 ggainey++ 15:41:11 daviddavis: ggainey's karma is now 64 15:41:16 precisely so it can get into 3.9 15:41:26 !propose convert to task and add to sprint 15:41:26 daviddavis: Error: "propose" is not a valid command. 15:41:32 I'll add the 3.9 milestone 15:41:36 thanks 15:41:56 why not as the pr introducing the new exception? 15:41:57 #idea Proposed for #7908: convert to task and add to sprint 15:41:57 !propose other convert to task and add to sprint 15:41:57 fao89: Proposed for #7908: convert to task and add to sprint 15:42:20 concur 15:42:22 or maybe this is to move the other ones? 15:42:43 bmbouter: the other ones as well 15:42:43 or maybe the value of this other issue is to have a plugin writer release note introducing it? 15:42:58 that makes sense, ty 15:43:35 #agreed convert to task and add to sprint 15:43:35 !accept 15:43:35 fao89: Current proposal accepted: convert to task and add to sprint 15:43:36 #topic https://pulp.plan.io/issues/7907 15:43:37 fao89: 5 issues left to triage: 7907, 7904, 7876, 7857, 5502 15:43:38 RM 7907 - osapryki - NEW - Failed curate_synclist_repository task did not clean up properly resource reservations 15:43:39 https://pulp.plan.io/issues/7907 15:44:26 oh yeah I got contacted on slack this morning about this one 15:44:43 ow, painful 15:44:56 (the issue, not necessarily slack-contact :) ) 15:44:59 accept or accept and add? 15:45:42 I lean towards accept and add. maybe even high severity? 15:45:49 this puts the system in an unusable state 15:45:53 agreed, but the issue is what to dooooo 15:46:04 #idea Proposed for #7907: accept and add to sprint 15:46:04 !propose other accept and add to sprint 15:46:04 fao89: Proposed for #7907: accept and add to sprint 15:46:05 yeah 15:46:07 we know that redis queues are unstable 15:46:08 #info dkliban has joined triage 15:46:08 !here 15:46:08 dkliban: dkliban has joined triage 15:46:18 this is not new information 15:46:34 apparently this happens only when there is 1 worker 15:46:45 any cleanup we can do on task failure? or does it happen at other point? 15:46:55 we know that RQ doesn't support clustered redis to guard against redis failure 15:47:03 should we make workers=2 as default? 15:47:05 the issue is the task has not failed... 15:47:13 worker count is not part of the issue I believe 15:47:20 I guess the full blockage is easier to reproduce with one worker 15:47:29 but eventually all can be blocked probably 15:47:31 true 15:48:07 so here's the issue ... redis fails and the task is lost from redis, the worker stays running and heartbeating so pulp can neve rknow that there was a prolbem 15:48:13 the heartbeats flow to postgresql 15:48:48 this is one main motivation why I think we should eliminate the resource manager and move to tasking recorded entirely in postgresql 15:49:11 because our correctness is "split" across two systems, and note RQ doesn't support clustered redis so redis is a single point of failure 15:49:29 so one of those systems is unreliable and that is beyond our control (or influence) 15:49:50 makes sense 15:50:00 fwiw I think this is a dupe of https://pulp.plan.io/issues/7386 15:50:42 so this is triage and we can move on but the thing is we can't just "put it on the sprint without more action" 15:50:57 can we periodically check with redis if the task is there and act accordingly? It's not the solution but maybe a less invasive workaround while we have a resource_manager. 15:51:46 we could have the resource manager probably do checking 15:51:57 but then we'll get another bug I suspect "pulp mysteriously cancels my tasks" 15:52:23 heh, yeah 15:52:33 the args and kwargs live in the redis task alone so we can't redispatch 15:52:37 let's just triage it for now, and have more discussion on the issue? 15:52:54 sounds fair 15:52:54 the problem is we're getting "stern talkings to" about this 15:53:05 bmbouter, an alternative is have the heartbeats record in redis 15:53:25 that's possible, but I think that further splits our system though 15:53:40 indeed 15:53:51 I can schedule a 30 min call to discuss if folks would want to join 15:54:00 or discuss on the issue 15:54:21 the only viable road I see is resource manager elimination and removal of redis from the architecture 15:54:30 I don't have any objection to a queue design based on postgresql, we don't have high throughput requirements (where high is measured in terms of tasks/second) 15:54:31 I can sufficiently motivate this on the issue I guess would be a good next step 15:55:00 let me post on the issue and send out to pulp-dev for feedback, how does that sound? 15:55:06 +1 15:55:10 I'm also not against removing the resource manager 15:55:16 +1 thank you bmbouter 15:55:22 +1 15:55:31 I was more thinking what can be done if we need to fix it asap 15:55:31 yeah I perceive that, the proposal of a less invasive workaround is a good one 15:55:53 +1 to also have a call a talk about the resource manager elimination 15:56:06 I will include the workaround in my post also, perhaps we'll do both 15:56:20 and schedule a call following pulp-dev posting 15:56:31 ty 15:56:39 which action should I take? accept? skip? accept and add? 15:56:48 accept I prio and I'll post on it 15:57:02 "accept w/ high prio" that should have said... 15:57:05 minorly related to https://pulp.plan.io/issues/4343 also perhaps 15:57:18 #idea Proposed for #7907: accept with high prio 15:57:18 !propose other accept with high prio 15:57:18 fao89: Proposed for #7907: accept with high prio 15:57:25 #agreed accept with high prio 15:57:25 !accept 15:57:25 fao89: Current proposal accepted: accept with high prio 15:57:26 #topic https://pulp.plan.io/issues/7904 15:57:27 fao89: 4 issues left to triage: 7904, 7876, 7857, 5502 15:57:28 RM 7904 - ggainey - NEW - PulpImport can deadlock when importing Centos*-base and app-stream in one import file 15:57:29 https://pulp.plan.io/issues/7904 15:57:47 accept and add to sprint I think 15:57:51 I hit this yesterday -opened the issue to record the scenario and failure-output 15:57:58 aye 15:58:12 #idea Proposed for #7904: accept and add to sprint 15:58:12 !propose other accept and add to sprint 15:58:12 fao89: Proposed for #7904: accept and add to sprint 15:58:24 #agreed accept and add to sprint 15:58:24 !accept 15:58:24 fao89: Current proposal accepted: accept and add to sprint 15:58:25 #topic https://pulp.plan.io/issues/7876 15:58:25 fao89: 3 issues left to triage: 7876, 7857, 5502 15:58:26 RM 7876 - adam.winberg@smhi.se - NEW - NoneType' object has no attribute 'pk' 15:58:27 https://pulp.plan.io/issues/7876 15:59:11 we discussed this one last week at pulpcore meeting and agreed to fix in pulpcore 15:59:33 the issue can be reproduced using the migration plugin 15:59:34 #idea Proposed for #7876: Leave the issue as-is, accepting its current state. 15:59:34 !propose accept 15:59:34 fao89: Proposed for #7876: Leave the issue as-is, accepting its current state. 15:59:43 +1 15:59:47 +1 15:59:56 ttereshc: add to sprint? 16:00:49 +1 not against adding to sprint 16:00:59 can be 16:01:04 the fix should be small and easy if i remember correctly 16:01:16 it's not the highest priority but need to be fixed 16:01:22 #idea Proposed for #7876: accept and add to sprint 16:01:22 !propose other accept and add to sprint 16:01:22 fao89: Proposed for #7876: accept and add to sprint 16:01:26 #agreed accept and add to sprint 16:01:26 !accept 16:01:27 fao89: Current proposal accepted: accept and add to sprint 16:01:27 #topic https://pulp.plan.io/issues/7857 16:01:28 fao89: 2 issues left to triage: 7857, 5502 16:01:29 RM 7857 - jsherril@redhat.com - NEW - 502 proxy error when trying to yum install a package 16:01:30 https://pulp.plan.io/issues/7857 16:03:14 not a lot to go on in the issue 16:03:37 we skipped it last time, but I don't remember if we had an AI 16:04:57 skip again? 16:06:03 could this be related to the django cleanup integration we had? 16:06:06 we can close it asking to provide more info to reproduce if someone encounters it again 16:06:20 dkliban, it was introduced in 3.7 16:06:23 gotcha 16:06:24 we ruled it out last time 16:06:45 yeah ... let's close it out for now 16:07:04 fao89, I can do it if you want 16:07:20 please, do it 16:07:24 i found in the logs that we meant to ping jsherrill and ask more info about setup 16:07:49 we can close and still do it I think 16:07:50 #idea Proposed for #7857: close and ask for more info 16:07:50 !propose other close and ask for more info 16:07:50 fao89: Proposed for #7857: close and ask for more info 16:07:59 ok 16:08:26 #agreed close and ask for more info 16:08:26 !accept 16:08:26 fao89: Current proposal accepted: close and ask for more info 16:08:27 fao89: 1 issues left to triage: 5502 16:08:27 #topic https://pulp.plan.io/issues/5502 16:08:28 RM 5502 - mihai.ibanescu@gmail.com - ASSIGNED - worker fails to mark a task as failed when critical conditions are encountered 16:08:29 https://pulp.plan.io/issues/5502 16:09:10 bmbouter: ^ 16:09:17 this already has pr 16:09:50 just need to update the commit message 16:09:56 let's triage and put on the sprint 16:10:08 * bmbouter looks 16:10:23 #idea Proposed for #5502: accept and add to sprint 16:10:23 !propose other accept and add to sprint 16:10:23 fao89: Proposed for #5502: accept and add to sprint 16:10:29 yes this PR is open 16:10:52 +1 16:10:53 #agreed accept and add to sprint 16:10:53 !accept 16:10:53 fao89: Current proposal accepted: accept and add to sprint 16:10:54 fao89: No issues to triage. 16:10:54 I need to add a changelog to it 16:10:57 +1 16:11:11 #endmeeting