Skip to content

Commit f31d92c

Browse files
Source Zendesk Support: refactor and debug stuck syncs (#20900)
* #1231 source zendesk support: remove synchronous sleep, add logging, reduce backoff time * #1231 source zendesk support: upd changelog * #1231 source zendesk: reduce sleeps * auto-bump connector version Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
1 parent 5b731d1 commit f31d92c

File tree

9 files changed

+259
-138
lines changed

9 files changed

+259
-138
lines changed

airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1991,7 +1991,7 @@
19911991
- name: Zendesk Support
19921992
sourceDefinitionId: 79c1aa37-dae3-42ae-b333-d1c105477715
19931993
dockerRepository: airbyte/source-zendesk-support
1994-
dockerImageTag: 0.2.19
1994+
dockerImageTag: 0.2.20
19951995
documentationUrl: https://docs.airbyte.com/integrations/sources/zendesk-support
19961996
icon: zendesk-support.svg
19971997
sourceType: api

airbyte-config/init/src/main/resources/seed/source_specs.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16227,7 +16227,7 @@
1622716227
path_in_connector_config:
1622816228
- "credentials"
1622916229
- "client_secret"
16230-
- dockerImage: "airbyte/source-zendesk-support:0.2.19"
16230+
- dockerImage: "airbyte/source-zendesk-support:0.2.20"
1623116231
spec:
1623216232
documentationUrl: "https://docs.airbyte.com/integrations/sources/zendesk-support"
1623316233
connectionSpecification:
@@ -16299,6 +16299,12 @@
1629916299
https://docs.airbyte.com/integrations/sources/zendesk-support#setup-guide\"\
1630016300
>docs</a> for more information."
1630116301
airbyte_secret: true
16302+
ignore_pagination:
16303+
type: "boolean"
16304+
default: false
16305+
description: "Makes each stream read a single page of data."
16306+
title: "Should the connector read the second and further pages of data."
16307+
airbyte_hidden: true
1630216308
supportsNormalization: false
1630316309
supportsDBT: false
1630416310
supported_destination_sync_modes: []

airbyte-integrations/connectors/source-zendesk-support/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,5 +25,5 @@ COPY source_zendesk_support ./source_zendesk_support
2525
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
2626
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
2727

28-
LABEL io.airbyte.version=0.2.19
28+
LABEL io.airbyte.version=0.2.20
2929
LABEL io.airbyte.name=airbyte/source-zendesk-support

airbyte-integrations/connectors/source-zendesk-support/integration_tests/expected_records.jsonl

Lines changed: 117 additions & 116 deletions
Large diffs are not rendered by default.

airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/source.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ def convert_config2stream_args(cls, config: Mapping[str, Any]) -> Mapping[str, A
101101
"subdomain": config["subdomain"],
102102
"start_date": config["start_date"],
103103
"authenticator": cls.get_authenticator(config),
104+
"ignore_pagination": config.get("ignore_pagination", False),
104105
}
105106

106107
def streams(self, config: Mapping[str, Any]) -> List[Stream]:

airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/spec.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,13 @@
6868
}
6969
}
7070
]
71+
},
72+
"ignore_pagination": {
73+
"type": "boolean",
74+
"default": false,
75+
"description": "Makes each stream read a single page of data.",
76+
"title": "Should the connector read the second and further pages of data.",
77+
"airbyte_hidden": true
7178
}
7279
}
7380
},

airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/streams.py

Lines changed: 73 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
#
44

55
import calendar
6+
import functools
7+
import logging
68
import re
79
import time
810
from abc import ABC
911
from collections import deque
1012
from concurrent.futures import Future, ProcessPoolExecutor
11-
from datetime import datetime
13+
from datetime import datetime, timedelta
1214
from functools import partial
1315
from math import ceil
1416
from pickle import PickleError, dumps
@@ -31,6 +33,33 @@
3133
LAST_END_TIME_KEY: str = "_last_end_time"
3234
END_OF_STREAM_KEY: str = "end_of_stream"
3335

36+
logger = logging.getLogger("airbyte")
37+
38+
# For some streams, multiple http requests are running at the same time for performance reasons.
39+
# However, it may result in hitting the rate limit, therefore subsequent requests have to be made after a pause.
40+
# The idea is to sustain a pause once and continue making multiple requests at a time.
41+
# A single `retry_at` variable is introduced here, which prevents us from duplicate sleeping in the main thread
42+
# before each request is made as it used to be in prior versions.
43+
# It acts like a global counter - increased each time a 429 status is met
44+
# only if it is greater than the current value. On the other hand, no request may be made before this moment.
45+
# Because the requests are made in parallel, time.sleep will be called in parallel as well.
46+
# This is possible because it is a point in time, not timedelta.
47+
retry_at: Optional[datetime] = None
48+
49+
50+
def sleep_before_executing(sleep_time: float):
51+
def wrapper(function):
52+
@functools.wraps(function)
53+
def inner(*args, **kwargs):
54+
logger.info(f"Sleeping {sleep_time} seconds before next request")
55+
time.sleep(int(sleep_time))
56+
result = function(*args, **kwargs)
57+
return result, datetime.utcnow()
58+
59+
return inner
60+
61+
return wrapper
62+
3463

3564
def to_int(s):
3665
"https://github.com/airbytehq/airbyte/issues/13673"
@@ -60,10 +89,16 @@ def send_future(self, request: requests.PreparedRequest, **kwargs) -> Future:
6089
if self.session:
6190
func = self.session.send
6291
else:
92+
sleep_time = 0
93+
now = datetime.utcnow()
94+
if retry_at and retry_at > now:
95+
sleep_time = (retry_at - datetime.utcnow()).seconds
6396
# avoid calling super to not break pickled method
6497
func = partial(requests.Session.send, self)
98+
func = sleep_before_executing(sleep_time)(func)
6599

66100
if isinstance(self.executor, ProcessPoolExecutor):
101+
self.logger.warning("ProcessPoolExecutor is used to perform IO related tasks for unknown reason!")
67102
# verify function can be pickled
68103
try:
69104
dumps(func)
@@ -74,11 +109,12 @@ def send_future(self, request: requests.PreparedRequest, **kwargs) -> Future:
74109

75110

76111
class BaseSourceZendeskSupportStream(HttpStream, ABC):
77-
def __init__(self, subdomain: str, start_date: str, **kwargs):
112+
def __init__(self, subdomain: str, start_date: str, ignore_pagination: bool = False, **kwargs):
78113
super().__init__(**kwargs)
79114

80115
self._start_date = start_date
81116
self._subdomain = subdomain
117+
self._ignore_pagination = ignore_pagination
82118

83119
def backoff_time(self, response: requests.Response) -> Union[int, float]:
84120
"""
@@ -93,10 +129,9 @@ def backoff_time(self, response: requests.Response) -> Union[int, float]:
93129
return retry_after
94130

95131
# the header X-Rate-Limit returns the amount of requests per minute
96-
# we try to wait twice as long
97132
rate_limit = float(response.headers.get("X-Rate-Limit", 0))
98133
if rate_limit and rate_limit > 0:
99-
return (60.0 / rate_limit) * 2
134+
return 60.0 / rate_limit
100135
return super().backoff_time(response)
101136

102137
@staticmethod
@@ -211,7 +246,7 @@ def generate_future_requests(
211246
stream_state: Mapping[str, Any] = None,
212247
):
213248
records_count = self.get_api_records_count(stream_slice=stream_slice, stream_state=stream_state)
214-
249+
self.logger.info(f"Records count is {records_count}")
215250
page_count = ceil(records_count / self.page_size)
216251
for page_number in range(1, page_count + 1):
217252
params = self.request_params(stream_state=stream_state, stream_slice=stream_slice)
@@ -228,8 +263,14 @@ def generate_future_requests(
228263

229264
request_kwargs = self.request_kwargs(stream_state=stream_state, stream_slice=stream_slice)
230265
self.future_requests.append(
231-
{"future": self._send_request(request, request_kwargs), "request": request, "request_kwargs": request_kwargs, "retries": 0}
266+
{
267+
"future": self._send_request(request, request_kwargs),
268+
"request": request,
269+
"request_kwargs": request_kwargs,
270+
"retries": 0,
271+
}
232272
)
273+
self.logger.info(f"Generated {len(self.future_requests)} future requests")
233274

234275
def _send(self, request: requests.PreparedRequest, request_kwargs: Mapping[str, Any]) -> Future:
235276
response: Future = self._session.send_future(request, **request_kwargs)
@@ -264,15 +305,20 @@ def _retry(
264305
retries: int,
265306
original_exception: Exception = None,
266307
response: requests.Response = None,
308+
finished_at: Optional[datetime] = None,
267309
**request_kwargs,
268310
):
269311
if retries == self.max_retries:
270312
if original_exception:
271313
raise original_exception
272314
raise DefaultBackoffException(request=request, response=response)
273-
if response is not None:
274-
backoff_time = self.backoff_time(response)
275-
time.sleep(max(0, int(backoff_time - response.elapsed.total_seconds())))
315+
sleep_time = self.backoff_time(response)
316+
if response is not None and finished_at and sleep_time:
317+
current_retry_at = finished_at + timedelta(seconds=sleep_time)
318+
global retry_at
319+
if not retry_at or (retry_at < current_retry_at):
320+
retry_at = current_retry_at
321+
self.logger.info(f"Adding a request to be retried in {sleep_time} seconds")
276322
self.future_requests.append(
277323
{
278324
"future": self._send_request(request, request_kwargs),
@@ -292,17 +338,21 @@ def read_records(
292338
self.generate_future_requests(sync_mode=sync_mode, cursor_field=cursor_field, stream_slice=stream_slice, stream_state=stream_state)
293339

294340
while len(self.future_requests) > 0:
341+
self.logger.info("Starting another while loop iteration")
295342
item = self.future_requests.popleft()
296343
request, retries, future, kwargs = item["request"], item["retries"], item["future"], item["request_kwargs"]
297344

298345
try:
299-
response = future.result()
346+
response, finished_at = future.result()
300347
except TRANSIENT_EXCEPTIONS as exc:
348+
self.logger.info("Will retry the request because of a transient exception")
301349
self._retry(request=request, retries=retries, original_exception=exc, **kwargs)
302350
continue
303351
if self.should_retry(response):
304-
self._retry(request=request, retries=retries, response=response, **kwargs)
352+
self.logger.info("Will retry the request for other reason")
353+
self._retry(request=request, retries=retries, response=response, finished_at=finished_at, **kwargs)
305354
continue
355+
self.logger.info("Request successful, will parse the response now")
306356
yield from self.parse_response(response, stream_state=stream_state, stream_slice=stream_slice)
307357

308358

@@ -324,6 +374,8 @@ def path(self, **kwargs):
324374
return self.name
325375

326376
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
377+
if self._ignore_pagination:
378+
return None
327379
next_page = self._parse_next_page_number(response)
328380
if not next_page:
329381
self._finished = True
@@ -357,6 +409,8 @@ def get_updated_state(self, current_stream_state: MutableMapping[str, Any], late
357409
return {self.cursor_field: max(new_value, old_value)}
358410

359411
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
412+
if self._ignore_pagination:
413+
return None
360414
start_time = dict(parse_qsl(urlparse(response.json().get(self.next_page_field), "").query)).get("start_time")
361415
if start_time != self.prev_start_time:
362416
self.prev_start_time = start_time
@@ -502,6 +556,8 @@ class GroupMemberships(SourceZendeskSupportCursorPaginationStream):
502556
"""GroupMemberships stream: https://developer.zendesk.com/api-reference/ticketing/groups/group_memberships/"""
503557

504558
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
559+
if self._ignore_pagination:
560+
return None
505561
next_page = self._parse_next_page_number(response)
506562
return next_page if next_page else None
507563

@@ -522,6 +578,8 @@ class SatisfactionRatings(SourceZendeskSupportCursorPaginationStream):
522578
"""
523579

524580
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
581+
if self._ignore_pagination:
582+
return None
525583
next_page = self._parse_next_page_number(response)
526584
return next_page if next_page else None
527585

@@ -548,6 +606,8 @@ class TicketMetrics(SourceZendeskSupportCursorPaginationStream):
548606
"""TicketMetric stream: https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_metrics/"""
549607

550608
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
609+
if self._ignore_pagination:
610+
return None
551611
next_page = self._parse_next_page_number(response)
552612
return next_page if next_page else None
553613

@@ -601,6 +661,8 @@ def request_params(self, next_page_token: Mapping[str, Any] = None, **kwargs) ->
601661
return params
602662

603663
def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
664+
if self._ignore_pagination:
665+
return None
604666
return response.json().get("before_cursor")
605667

606668

airbyte-integrations/connectors/source-zendesk-support/unit_tests/test_futures.py

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# Copyright (c) 2022 Airbyte, Inc., all rights reserved.
33
#
44

5+
import datetime
56
import json
67
from datetime import timedelta
78
from urllib.parse import urljoin
@@ -14,7 +15,7 @@
1415
from airbyte_cdk.sources.streams.http.exceptions import DefaultBackoffException
1516
from requests.exceptions import ConnectionError
1617
from source_zendesk_support.source import BasicApiTokenAuthenticator
17-
from source_zendesk_support.streams import Macros
18+
from source_zendesk_support.streams import Macros, Organizations
1819

1920
STREAM_ARGS: dict = {
2021
"subdomain": "fake-subdomain",
@@ -23,7 +24,7 @@
2324
}
2425

2526

26-
@pytest.fixture(autouse=True)
27+
@pytest.fixture()
2728
def time_sleep_mock(mocker):
2829
time_mock = mocker.patch("time.sleep", lambda x: None)
2930
yield time_mock
@@ -39,7 +40,7 @@ def time_sleep_mock(mocker):
3940
(101, 100, 2),
4041
],
4142
)
42-
def test_proper_number_of_future_requests_generated(records_count, page_size, expected_futures_deque_len):
43+
def test_proper_number_of_future_requests_generated(records_count, page_size, expected_futures_deque_len, time_sleep_mock):
4344
stream = Macros(**STREAM_ARGS)
4445
stream.page_size = page_size
4546

@@ -60,7 +61,7 @@ def test_proper_number_of_future_requests_generated(records_count, page_size, ex
6061
(10, 10, 0),
6162
],
6263
)
63-
def test_parse_future_records(records_count, page_size, expected_futures_deque_len):
64+
def test_parse_future_records(records_count, page_size, expected_futures_deque_len, time_sleep_mock):
6465
stream = Macros(**STREAM_ARGS)
6566
stream.page_size = page_size
6667
expected_records = [
@@ -82,7 +83,7 @@ def test_parse_future_records(records_count, page_size, expected_futures_deque_l
8283
if not stream.future_requests and not expected_futures_deque_len:
8384
assert len(stream.future_requests) == 0 and not expected_records
8485
else:
85-
response = stream.future_requests[0]["future"].result()
86+
response, _ = stream.future_requests[0]["future"].result()
8687
records = list(stream.parse_response(response, stream_state=None, stream_slice=None))
8788
assert records == expected_records
8889

@@ -97,7 +98,7 @@ def test_parse_future_records(records_count, page_size, expected_futures_deque_l
9798
(101, 101, 2, None),
9899
],
99100
)
100-
def test_read_records(mocker, records_count, page_size, expected_futures_deque_len, expected_exception):
101+
def test_read_records(mocker, records_count, page_size, expected_futures_deque_len, expected_exception, time_sleep_mock):
101102
stream = Macros(**STREAM_ARGS)
102103
stream.page_size = page_size
103104
should_retry = bool(expected_exception)
@@ -131,3 +132,45 @@ def record_gen(start=0, end=page_size):
131132
list(stream.read_records(sync_mode=SyncMode.full_refresh))
132133
else:
133134
assert list(stream.read_records(sync_mode=SyncMode.full_refresh)) == list(record_gen(end=expected_records_count))
135+
136+
137+
def test_sleep_time():
138+
page_size = 100
139+
x_rate_limit = 10
140+
records_count = 350
141+
pages = 4
142+
143+
start = datetime.datetime.now()
144+
stream = Organizations(**STREAM_ARGS)
145+
stream.page_size = page_size
146+
147+
def record_gen(start=0, end=100):
148+
for i in range(start, end):
149+
yield {f"key{i}": f"val{i}", stream.cursor_field: (pendulum.parse("2020-01-01") + timedelta(days=i)).isoformat()}
150+
151+
with requests_mock.Mocker() as m:
152+
count_url = urljoin(stream.url_base, f"{stream.path()}/count.json")
153+
m.get(count_url, text=json.dumps({"count": {"value": records_count}}))
154+
155+
records_url = urljoin(stream.url_base, stream.path())
156+
responses = [
157+
{
158+
"status_code": 429,
159+
"headers": {"X-Rate-Limit": str(x_rate_limit)},
160+
"text": "{}"
161+
}
162+
for _ in range(pages)
163+
] + [
164+
{
165+
"status_code": 200,
166+
"headers": {},
167+
"text": json.dumps({"organizations": list(record_gen(page * page_size, min(records_count, (page + 1) * page_size)))})
168+
}
169+
for page in range(pages)
170+
]
171+
m.get(records_url, responses)
172+
records = list(stream.read_records(sync_mode=SyncMode.full_refresh))
173+
assert len(records) == records_count
174+
end = datetime.datetime.now()
175+
sleep_time = int(60 / x_rate_limit)
176+
assert sleep_time - 1 <= (end - start).seconds <= sleep_time + 1

docs/integrations/sources/zendesk-support.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,9 @@ The Zendesk connector ideally should not run into Zendesk API limitations under
5959
## Changelog
6060

6161
| Version | Date | Pull Request | Subject |
62-
|:---------|:-----------| :------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
63-
| `0.2.19` | 2022-12-09 | [19967](https://github.com/airbytehq/airbyte/pull/19967) | Fix reading response for more than 100k records |
62+
|:---------|:-----------|:---------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
63+
| `0.2.20` | 2022-12-28 | [20900](https://github.com/airbytehq/airbyte/pull/20900) | Remove synchronous time.sleep, add logging, reduce backoff time |
64+
| `0.2.19` | 2022-12-09 | [19967](https://github.com/airbytehq/airbyte/pull/19967) | Fix reading response for more than 100k records |
6465
| `0.2.18` | 2022-11-29 | [19432](https://github.com/airbytehq/airbyte/pull/19432) | Revert changes from version 0.2.15, use a test read instead |
6566
| `0.2.17` | 2022-11-24 | [19792](https://github.com/airbytehq/airbyte/pull/19792) | Transform `ticket_comments.via` "-" to null |
6667
| `0.2.16` | 2022-09-28 | [17326](https://github.com/airbytehq/airbyte/pull/17326) | Migrate to per-stream states. |

0 commit comments

Comments
 (0)