|
|
| Created: 11 years, 3 months ago by mruthven Modified: 10 years, 7 months ago Base URL: https://github.com/maryruthven/web-page-replay.git@master Visibility: Public. | Descriptionremoving hacks removing old wrap removing whit space wpr hacks BUG= Patch Set 1 #Patch Set 2 : reverting files #Patch Set 3 : httpclient #Patch Set 4 : reverting deterministic #Patch Set 5 : remove path_for_matching #Patch Set 6 : fixing httparchive tostring #Patch Set 7 : putting path_for_matching back in #Patch Set 8 : removing whitespace #Patch Set 9 : syncing src #Patch Set 10 : style #Patch Set 11 : style #Patch Set 12 : removing weird log #Patch Set 13 : modify rules #Patch Set 14 : Adding path_for_matching to generalize paths #Patch Set 15 : removing some logging #Patch Set 16 : generalize paths #Patch Set 17 : Removing extra whitespace #Patch Set 18 : adding flag for rules #Patch Set 19 : #Patch Set 20 : #Patch Set 21 : Load json rules from file #Patch Set 22 : modifying rules #Patch Set 23 : #Patch Set 24 : # Total comments: 60 Patch Set 25 : #Patch Set 26 : #Patch Set 27 : #Patch Set 28 : #Patch Set 29 : adding comments #Patch Set 30 : #Patch Set 31 : Fixing parsing #Patch Set 32 : modifying rules #Patch Set 33 : requestMatches -> urlMatches #Patch Set 34 : adding isinstance checks # Total comments: 38 Patch Set 35 : Changing instance checks and rules #Patch Set 36 : fixing errors # Total comments: 12 Patch Set 37 : removing rules # Total comments: 28 Patch Set 38 : modifying the callback rules #Patch Set 39 : Fix callback replacement # Total comments: 1 Patch Set 40 : add TODO #Patch Set 41 : remove log #Patch Set 42 : change names #Patch Set 43 : #Patch Set 44 : style #
MessagesTotal messages: 17
Add rules to modify request paths and responses to get around certain archive lookup errors. It modifies how requests are stored in the archive by adding a path_for_matching. Adds a way to designate parameters in the lookup url to remove that may change between record replay. Declare paths that should send 204s Declare headers to remove for specific paths. Sign in to reply to this message.
Is there a document that describes this change? I want to think a bit about the implementation approach, and I want to make sure I understand the intent of the change. https://codereview.appspot.com/114560043/diff/400001/certutils.py File certutils.py (right): https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode97 certutils.py:97: These edits are already in the tree. Would you update this review to use the correct base? https://codereview.appspot.com/114560043/diff/400001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode477 httparchive.py:477: is_ssl: a boolean which is True if request is make via SSL. (iff is short of if-and-only-if.) This can be shortened to the following: is_ssl: True if request is made via SSL. https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode477 httparchive.py:477: is_ssl: a boolean which is True if request is make via SSL. path_for_matching and more_undesireable_keys need documentation. https://codereview.appspot.com/114560043/diff/400001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode445 httpclient.py:445: if re.match(r'%s' % callback_path, '%s%s' % (request.host, The r'' syntax is only needed to avoid interpreting backslashes in string literals. In this case, the following is equivalent: if re.match(callback_path, '%s%s' % (request.host, ... https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode450 httpclient.py:450: newkey = request.full_path.rsplit('callback=_xdc_._', 1)[1] Is "callback=_xdc_._" supposed to be specified by the callback_path? https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode473 httpclient.py:473: use_record_mode: If True, start in server in record mode. Add an entry for "rules". Sign in to reply to this message.
Re: the design, see my notes in: https://docs.google.com/a/google.com/document/d/19ACPb_bS7W691IlXiqlksfLukhbv... https://codereview.appspot.com/114560043/diff/400001/certutils.py File certutils.py (right): https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode149 certutils.py:149: if len(host_certs) > 0: Nit: prefer successful return at end, i.e.: if not host_certs: logger.warning... return '' return _dump_... https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode151 certutils.py:151: logging.warning('Did not get SNI from server') log "from " host:port instead of "from server" https://codereview.appspot.com/114560043/diff/400001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode489 httparchive.py:489: path_for_matching if path_for_matching else full_path) = (path_for_matching or full_path) https://codereview.appspot.com/114560043/diff/400001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode445 httpclient.py:445: if re.match(r'%s' % callback_path, '%s%s' % (request.host, see below re: re.compile in parse_rules https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode450 httpclient.py:450: newkey = request.full_path.rsplit('callback=_xdc_._', 1)[1] On 2014/08/06 21:58:54, slamm wrote: > Is "callback=_xdc_._" supposed to be specified by the callback_path? This is admittedly not fully generic yet; we (I?) plan to refactor this in a follow-up CL. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode494 httpclient.py:494: if rules: assume rules != None, so don't need this "if rules" check https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode498 httpclient.py:498: host, paths = url allow url as single str? https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode498 httpclient.py:498: host, paths = url add check for: - isinstance(url, list) - isinstance(host, str) - isinstance(paths, list) where each path is str that startswith "/" https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode500 httpclient.py:500: callback_paths.add('%s%s' % (host, path)) re.compile the regex here. https://codereview.appspot.com/114560043/diff/400001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode106 httpproxy.py:106: logging.error('undesirable path %s %s', parsed.path, full_path) logger.debug? https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode299 httpproxy.py:299: def parse_rules(self, rules): as in httpclient parse_rules, add isinstance checks. https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode305 httpproxy.py:305: for rule, url, action in rules: rename "rule" to "predicate", so we can say "rules is a list of rule", where each rule is a list of: predicate, predicate_args, action [,action_args] https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode314 httpproxy.py:314: assert path.count('(') == path.count(')') Don't do this '()' tokenizing here, esp. because it'll die on (e.g.) "foo((bar))", "foo\(x", etc. Ideally re.compile the host+path here and use "start/end/span" in "get_archived_http_request", e.g.: p = re.compile("foo.com/(bar/)qux") s = "foo.com/bar/qux" m = p.match(s) assert p.groups() == 1 assert m.span(1) == (8, 12) assert s[:8]+s[12:] == "foo.com/qux" https://codereview.appspot.com/114560043/diff/400001/json_rules.txt File json_rules.txt (right): https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], mts0\.google\.com https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode8 json_rules.txt:8: ["isRequestPath", ["maps.gstatic.com", ["(.*)/mapfiles/transparent.png", "/mapfiles(.*)/shadow50.png", /(.*/)mapfiles/transparent.png i.e. all paths should start with '/' -- ideally the rule parser(s) will check this. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode15 json_rules.txt:15: ["isRequestPath", [".*.metric.gstatic.com", ["/gen_204.*"]], "[^\/?&]*\.metric\.gstatic\.com" (or equiv -- i.e. don't want to match "foo.com/bar/metric.gstatic.com/gen204" https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode25 json_rules.txt:25: ["isRequestPath", ["/maps/preview/log204", "cache-control"], should be [host, [paths]] https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode26 json_rules.txt:26: "removeHeader"] change to "removeHeader", ["cache-control"] ? https://codereview.appspot.com/114560043/diff/400001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/400001/replay.py#newcode136 replay.py:136: json_rules = json.load(json_file) For now there's no way to detect misspelled/unknown predicates/actions, e.g. "isBlah". Let's do that in a follow-up CL, e.g. via an optional json schema. This somewhat makes the above error checking redundant, but I think both are valid, i.e.: - the above instanceof verifies the input is well formed. - the optional schema verifies that the specific pred/actions are supported. https://codereview.appspot.com/114560043/diff/400001/replay.py#newcode136 replay.py:136: json_rules = json.load(json_file) add type check for json_rules: - rules must be a isinstance(json_rules, list) - each rule must be a list of len 3..4, of types [str, list of strs, str, [,list of strs]] Special case: json lacks a comment operator, so also allow ["_comment", ...] per: http://stackoverflow.com/questions/244777/can-i-comment-a-json-file e.g. [["_comment", "blah blah", "whatever", 1234], ["isFetchRule", [], "doStuff"], ["_comment", "yada3"] ] Sign in to reply to this message.
more rule-language tweaking https://codereview.appspot.com/114560043/diff/400001/json_rules.txt File json_rules.txt (right): https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode1 json_rules.txt:1: [ as noted in replay.py, allow "_comment" and illustrate one here, e.g.: ["_comment", "Example rules for google search, bing search, and maps"], maybe also w/ link to our TBD docs. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], This [host, [paths...]] seems potentially confusing. Instead of: ['isFetchPath', ['foo.com', ['/bar', '/qux']], 'replaceCallback'] maybe we should do: ['isFetchPath', ['foo.com/bar'], 'replaceCallback'] ['isFetchPath', ['foo.com/qux'], 'replaceCallback'] or: ['isFetchPath', ['foo.com/bar', 'foo.com/qux'], 'replaceCallback'] One upshot of the latter is that we could do: ['isFetchPath', ['foo.com/bar', 'foo.com/qux', 'bar.com/', ...], 'replaceCallback'] which'd group all the tests into a big "OR" with a shared action. (in this particular case, you could do "mts0.google.com/(maps/)?vt.*callback.*") https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], rename "isFetchPath" to something else... TBD. let's discuss offline. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode4 json_rules.txt:4: ["isRequestPath", ["gg.google.com", ["/csi.*jsv(.*)"]], rename "isRequestPath" to "isRequestURL" or "urlMatches"? the latter sounds nice... https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode5 json_rules.txt:5: "generalizePath"], rename "generalizePath" to "trimURL" or "removeGroupsFromURL"? prefer "URL" b/c input could be "(.*)\.foo\.com". maybe put the word "archive" in there somewhere, b/c we're not actually changing the URL that we send upstream; we're changing the "matching path" for archive lookups. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode14 json_rules.txt:14: "send204"], maybe change "send204" to "sendStatus", ["204"] or "sendStatus", [204] ? It's a tradeoff between generality vs simplicity, 'though (in this case) I kinda like the brevity of "send204", so I'm ambivalent. Sign in to reply to this message.
https://codereview.appspot.com/114560043/diff/400001/certutils.py File certutils.py (right): https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode97 certutils.py:97: On 2014/08/06 21:58:54, slamm wrote: > These edits are already in the tree. Would you update this review to use the > correct base? Done. https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode97 certutils.py:97: On 2014/08/06 21:58:54, slamm wrote: > These edits are already in the tree. Would you update this review to use the > correct base? Done. https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode149 certutils.py:149: if len(host_certs) > 0: On 2014/08/07 20:08:40, wrightt wrote: > Nit: prefer successful return at end, i.e.: > if not host_certs: > logger.warning... > return '' > return _dump_... Done. https://codereview.appspot.com/114560043/diff/400001/certutils.py#newcode151 certutils.py:151: logging.warning('Did not get SNI from server') On 2014/08/07 20:08:40, wrightt wrote: > log "from " host:port instead of "from server" Done. https://codereview.appspot.com/114560043/diff/400001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode477 httparchive.py:477: is_ssl: a boolean which is True if request is make via SSL. On 2014/08/06 21:58:54, slamm wrote: > (iff is short of if-and-only-if.) > > This can be shortened to the following: > > is_ssl: True if request is made via SSL. Done. https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode477 httparchive.py:477: is_ssl: a boolean which is True if request is make via SSL. On 2014/08/06 21:58:54, slamm wrote: > path_for_matching and more_undesireable_keys need documentation. Done. https://codereview.appspot.com/114560043/diff/400001/httparchive.py#newcode489 httparchive.py:489: path_for_matching if path_for_matching else full_path) On 2014/08/07 20:08:40, wrightt wrote: > = (path_for_matching or full_path) Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode445 httpclient.py:445: if re.match(r'%s' % callback_path, '%s%s' % (request.host, On 2014/08/06 21:58:54, slamm wrote: > The r'' syntax is only needed to avoid interpreting backslashes in string > literals. > In this case, the following is equivalent: > > if re.match(callback_path, '%s%s' % (request.host, ... Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode445 httpclient.py:445: if re.match(r'%s' % callback_path, '%s%s' % (request.host, On 2014/08/06 21:58:54, slamm wrote: > The r'' syntax is only needed to avoid interpreting backslashes in string > literals. > In this case, the following is equivalent: > > if re.match(callback_path, '%s%s' % (request.host, ... Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode473 httpclient.py:473: use_record_mode: If True, start in server in record mode. On 2014/08/06 21:58:54, slamm wrote: > Add an entry for "rules". Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode494 httpclient.py:494: if rules: On 2014/08/07 20:08:40, wrightt wrote: > assume rules != None, so don't need this "if rules" check Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode498 httpclient.py:498: host, paths = url On 2014/08/07 20:08:40, wrightt wrote: > allow url as single str? Done. https://codereview.appspot.com/114560043/diff/400001/httpclient.py#newcode500 httpclient.py:500: callback_paths.add('%s%s' % (host, path)) On 2014/08/07 20:08:40, wrightt wrote: > re.compile the regex here. Done. https://codereview.appspot.com/114560043/diff/400001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode106 httpproxy.py:106: logging.error('undesirable path %s %s', parsed.path, full_path) On 2014/08/07 20:08:41, wrightt wrote: > logger.debug? Acknowledged. https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode305 httpproxy.py:305: for rule, url, action in rules: On 2014/08/07 20:08:41, wrightt wrote: > rename "rule" to "predicate", so we can say "rules is a list of rule", where > each rule is a list of: > predicate, predicate_args, action [,action_args] Done. https://codereview.appspot.com/114560043/diff/400001/httpproxy.py#newcode314 httpproxy.py:314: assert path.count('(') == path.count(')') On 2014/08/07 20:08:41, wrightt wrote: > Don't do this '()' tokenizing here, esp. because it'll die on (e.g.) > "foo((bar))", "foo\(x", etc. > > Ideally re.compile the host+path here and use "start/end/span" in > "get_archived_http_request", e.g.: > p = re.compile("foo.com/(bar/)qux") > s = "foo.com/bar/qux" > m = p.match(s) > assert p.groups() == 1 > assert m.span(1) == (8, 12) > assert s[:8]+s[12:] == "foo.com/qux" Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt File json_rules.txt (right): https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], On 2014/08/07 20:08:41, wrightt wrote: > http://mts0%5C.google%5C.com Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], On 2014/08/07 22:59:23, wrightt wrote: > rename "isFetchPath" to something else... TBD. let's discuss offline. Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode2 json_rules.txt:2: ["isFetchPath", ["mts0.google.com", ["/vt.*callback.*", "/maps/vt.*callback.*"]], On 2014/08/07 22:59:23, wrightt wrote: > This [host, [paths...]] seems potentially confusing. > > Instead of: > ['isFetchPath', ['foo.com', ['/bar', '/qux']], 'replaceCallback'] > maybe we should do: > ['isFetchPath', ['foo.com/bar'], 'replaceCallback'] > ['isFetchPath', ['foo.com/qux'], 'replaceCallback'] > or: > ['isFetchPath', ['foo.com/bar', 'foo.com/qux'], 'replaceCallback'] > > One upshot of the latter is that we could do: > ['isFetchPath', ['foo.com/bar', 'foo.com/qux', 'bar.com/', ...], > 'replaceCallback'] > which'd group all the tests into a big "OR" with a shared action. > > (in this particular case, you could do "mts0.google.com/(maps/)?vt.*callback.*") Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode4 json_rules.txt:4: ["isRequestPath", ["gg.google.com", ["/csi.*jsv(.*)"]], On 2014/08/07 22:59:23, wrightt wrote: > rename "isRequestPath" to "isRequestURL" or "urlMatches"? the latter sounds > nice... Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode5 json_rules.txt:5: "generalizePath"], On 2014/08/07 22:59:23, wrightt wrote: > rename "generalizePath" to "trimURL" or "removeGroupsFromURL"? > > prefer "URL" b/c input could be "(.*)\.foo\.com". > > maybe put the word "archive" in there somewhere, b/c we're not actually changing > the URL that we send upstream; we're changing the "matching path" for archive > lookups. Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode8 json_rules.txt:8: ["isRequestPath", ["maps.gstatic.com", ["(.*)/mapfiles/transparent.png", "/mapfiles(.*)/shadow50.png", Since it is now the whole path, this no longer applies On 2014/08/07 20:08:41, wrightt wrote: > /(.*/)mapfiles/transparent.png > > i.e. all paths should start with '/' -- ideally the rule parser(s) will check > this. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode14 json_rules.txt:14: "send204"], On 2014/08/07 22:59:23, wrightt wrote: > maybe change "send204" to > "sendStatus", ["204"] > or > "sendStatus", [204] > ? > It's a tradeoff between generality vs simplicity, 'though (in this case) I kinda > like the brevity of "send204", so I'm ambivalent. Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode15 json_rules.txt:15: ["isRequestPath", [".*.metric.gstatic.com", ["/gen_204.*"]], On 2014/08/07 20:08:41, wrightt wrote: > "[^\/?&]*\.metric\.gstatic\.com" > > (or equiv -- i.e. don't want to match "foo.com/bar/metric.gstatic.com/gen204" Done. https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode25 json_rules.txt:25: ["isRequestPath", ["/maps/preview/log204", "cache-control"], This one is different. It is supposed to be the headers to exclude for the url. On 2014/08/07 20:08:41, wrightt wrote: > should be [host, [paths]] https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode25 json_rules.txt:25: ["isRequestPath", ["/maps/preview/log204", "cache-control"], I cant find this path in our logs, so I dont know what the exact url was On 2014/08/07 20:08:41, wrightt wrote: > should be [host, [paths]] https://codereview.appspot.com/114560043/diff/400001/json_rules.txt#newcode26 json_rules.txt:26: "removeHeader"] On 2014/08/07 20:08:41, wrightt wrote: > change to "removeHeader", ["cache-control"] ? Done. https://codereview.appspot.com/114560043/diff/400001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/400001/replay.py#newcode136 replay.py:136: json_rules = json.load(json_file) On 2014/08/07 20:08:41, wrightt wrote: > add type check for json_rules: > - rules must be a isinstance(json_rules, list) > - each rule must be a list of len 3..4, of types [str, list of strs, str, [,list > of strs]] > > Special case: json lacks a comment operator, so also allow ["_comment", ...] > per: > http://stackoverflow.com/questions/244777/can-i-comment-a-json-file > e.g. > [["_comment", "blah blah", "whatever", 1234], > ["isFetchRule", [], "doStuff"], > ["_comment", "yada3"] > ] Done. Sign in to reply to this message.
looking good, notes below! https://codereview.appspot.com/114560043/diff/580001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/580001/httparchive.py#newcode467 httparchive.py:467: more_undesirable_keys=None): rename to "additional_undesirable_keys"? (to avoid confusion re: whether "more" is "more/less" degree as opposed to quantity) https://codereview.appspot.com/114560043/diff/580001/httparchive.py#newcode479 httparchive.py:479: more_undesirable_keys: A dictionary of headers to exclude. s/dictionary/list/ https://codereview.appspot.com/114560043/diff/580001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode267 httpclient.py:267: if ':' in request.host: Nit: simplify code & use split(.., 1) to prevent bogus "foo.com:123:456" input: host, port = request.host, None if ':' in host: parts = host.split(':', 1) host, port = parts[0], int(parts[1]) https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode275 httpclient.py:275: host_ip = self._real_dns_lookup(truehost) add comment re: why we're looking up the host_ip. If the host name usually works, would it make sense to do the retry loop once w/ the host name then switch to the host_ip on retry. We don't need to do this for this CL, but I think it's worth commenting in the code. https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode282 httpclient.py:282: if request.is_ssl: connection_class = (DetailedHTTPSConnection if request.is_ssl else DetailedHTTPConnection) connection = connection_cls(host_ip, port) https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode443 httpclient.py:443: def mutate_response(request, response, callback_paths): add a minimal doc, e.g.: """Modifies archived body's callback_id's with request's param.""" https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode495 httpclient.py:495: (predicate, predicate_args, action), action_args = rule[:3], rule[3:] in this specific case, don't need action_args, so can do: predicate, predicate_args, action = rule[:3] https://codereview.appspot.com/114560043/diff/580001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode95 httpproxy.py:95: for path in self.server.paths_to_generalize: rename "paths_to_generalize" to "paths_to_remove_groups_from" (ugh) or "paths_to_edit"? (b/c we renamed the rule from "generalizePath" to "removeGroupFromURL", so let's drop the "generalize"). https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode98 httpproxy.py:98: start_include = 0 add a brief comment re: what the code below is doing. E.g. // remove all matched groups from the matched URL. // e.g. if our pattern is "(.*\.)?foo.com/bar.*(qux=1).*" // then "abc.foo.com/bart?qux=1&z" --> "foo.com/bart?&z" https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode105 httpproxy.py:105: logging.info('doing replacement in %s mode: %s -> %s', s/doing replacement/replaced/ https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode112 httpproxy.py:112: more_undesirable_keys = [undesirable_key] append to list instead of "= [value]"? https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode314 httpproxy.py:314: assert isinstance(predicate_args, list) replace assert's with "if .. raise ValueError"'s. http://stackoverflow.com/questions/944592/#answer-945135 b/c rules is user input, so bad input isn't an "internal error: this should never happen" case. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode322 httpproxy.py:322: if re.search('\([^\)]?\(', url): ah, okay, you're checking for nested groups. cool, but -- and this is admittedly pedantic of me, so apologies but I've been bitten by this before! -- this won't work if the url has parenthesis. e.g. if I query gws w/ "a(b)c", the url is: https://www.google.com/?#q=a(b)c and, if I wanted to remove the "(b)", my pred_args would be: ["foo.com/a\(b(c)d\)e"] which you'd incorrectly reject. In general it's a PITA :/ another option is to catch this in get_archived_http_request, by checking the span start/stop indices for overlaps. https://codereview.appspot.com/114560043/diff/580001/json_rules.txt File json_rules.txt (right): https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode1 json_rules.txt:1: [ wrap file to 80 lines max width, if possible. https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode2 json_rules.txt:2: ["_comment", hmm, maybe also allow "# blah ..." <-- string must start with "\s*#" as shorthand for ["_comment", ...] let's support both; the latter is useful for commenting out rules, e.g.: ["_comment", "I disabled this!", ["urlMatches", ...], ...] https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode13 json_rules.txt:13: ["_comment", "Example rules for google search, bing search, and google search on an idevice."], add whitespace to group file into gws/bing/maps sections, e.g.: ["_comment", "google search"], ... ["_comment", "bing search"], ... https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode16 json_rules.txt:16: ["urlMatches", ["gg\\.google\\.com/csi.*jsv(.*)"], allow single pred_arg string, e.g.: ["urlMatches", "gg\\.google\\.com/csi.*jsv(.*)", ... ? I'm ambivalent about this, so please push back if you think it's a bad idea (e.g. could be more confusing and/or error prone than forcing the []'s). https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode35 json_rules.txt:35: ["urlMatches", [".*/maps/preview/log204"], add TODO comment re: needs host name. https://codereview.appspot.com/114560043/diff/580001/json_rules.txt#newcode41 json_rules.txt:41: ["urlMatches", [".*apple\\..*/.*"], Nit: use "[^/]*apple", to avoid matching "google.com/?q=apple.x/y" https://codereview.appspot.com/114560043/diff/580001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode139 replay.py:139: assert isinstance(rule, list) as noted elsewhere, replace assert w/ raise. once nice thing is you have "i" as the line number, so it'll be easy to print that in the exception. https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode143 replay.py:143: assert (len(rule) == 4) or (len(rule) == 3) perhaps overly clever, but can use 3 <= len(rule) < 4 https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode557 replay.py:557: harness_group.add_option('--json_rules', default='json_rules.txt', s/.txt/.json/ I guess that implies renaming your commit file, too... :/ Sign in to reply to this message.
https://codereview.appspot.com/114560043/diff/580001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/580001/httparchive.py#newcode467 httparchive.py:467: more_undesirable_keys=None): On 2014/08/12 16:01:38, wrightt wrote: > rename to "additional_undesirable_keys"? (to avoid confusion re: whether > "more" is "more/less" degree as opposed to quantity) Done. https://codereview.appspot.com/114560043/diff/580001/httparchive.py#newcode479 httparchive.py:479: more_undesirable_keys: A dictionary of headers to exclude. On 2014/08/12 16:01:38, wrightt wrote: > s/dictionary/list/ Done. https://codereview.appspot.com/114560043/diff/580001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode267 httpclient.py:267: if ':' in request.host: I used the wrong copy of httpclient I dont mean to make these changes. The only changes in httpclient are supposed to be related to mutate_response On 2014/08/12 16:01:38, wrightt wrote: > Nit: simplify code & use split(.., 1) to prevent bogus "foo.com:123:456" input: > > host, port = request.host, None > if ':' in host: > parts = host.split(':', 1) > host, port = parts[0], int(parts[1]) https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode275 httpclient.py:275: host_ip = self._real_dns_lookup(truehost) On 2014/08/12 16:01:38, wrightt wrote: > add comment re: why we're looking up the host_ip. > > If the host name usually works, would it make sense to do the retry loop once w/ > the host name then switch to the host_ip on retry. We don't need to do this for > this CL, but I think it's worth commenting in the code. Acknowledged. https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode282 httpclient.py:282: if request.is_ssl: On 2014/08/12 16:01:38, wrightt wrote: > connection_class = (DetailedHTTPSConnection if request.is_ssl else > DetailedHTTPConnection) > connection = connection_cls(host_ip, port) Acknowledged. https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode443 httpclient.py:443: def mutate_response(request, response, callback_paths): On 2014/08/12 16:01:38, wrightt wrote: > add a minimal doc, e.g.: > """Modifies archived body's callback_id's with request's param.""" Done. https://codereview.appspot.com/114560043/diff/580001/httpclient.py#newcode495 httpclient.py:495: (predicate, predicate_args, action), action_args = rule[:3], rule[3:] On 2014/08/12 16:01:38, wrightt wrote: > in this specific case, don't need action_args, so can do: > predicate, predicate_args, action = rule[:3] Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode95 httpproxy.py:95: for path in self.server.paths_to_generalize: On 2014/08/12 16:01:38, wrightt wrote: > rename "paths_to_generalize" to "paths_to_remove_groups_from" (ugh) or > "paths_to_edit"? > > (b/c we renamed the rule from "generalizePath" to "removeGroupFromURL", so let's > drop the "generalize"). Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode98 httpproxy.py:98: start_include = 0 On 2014/08/12 16:01:38, wrightt wrote: > add a brief comment re: what the code below is doing. > E.g. > // remove all matched groups from the matched URL. > // e.g. if our pattern is "(.*\.)?foo.com/bar.*(qux=1).*" > // then "abc.foo.com/bart?qux=1&z" --> "foo.com/bart?&z" > Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode105 httpproxy.py:105: logging.info('doing replacement in %s mode: %s -> %s', On 2014/08/12 16:01:38, wrightt wrote: > s/doing replacement/replaced/ Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode112 httpproxy.py:112: more_undesirable_keys = [undesirable_key] On 2014/08/12 16:01:38, wrightt wrote: > append to list instead of "= [value]"? Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode314 httpproxy.py:314: assert isinstance(predicate_args, list) On 2014/08/12 16:01:38, wrightt wrote: > replace assert's with "if .. raise ValueError"'s. > http://stackoverflow.com/questions/944592/#answer-945135 > > b/c rules is user input, so bad input isn't an "internal error: this should > never happen" case. Done. https://codereview.appspot.com/114560043/diff/580001/httpproxy.py#newcode322 httpproxy.py:322: if re.search('\([^\)]?\(', url): Parenthesis in the url have to be escaped so they aren't identified as groups, so I will just include that in the brackets. On 2014/08/12 16:01:38, wrightt wrote: > ah, okay, you're checking for nested groups. > > cool, but -- and this is admittedly pedantic of me, so apologies but I've been > bitten by this before! -- this won't work if the url has parenthesis. e.g. if > I query gws w/ "a(b)c", the url is: > https://www.google.com/#q=a(b)c > and, if I wanted to remove the "(b)", my pred_args would be: > ["foo.com/a\(b(c)d\)e"] > which you'd incorrectly reject. In general it's a PITA :/ > > another option is to catch this in get_archived_http_request, by checking the > span start/stop indices for overlaps. https://codereview.appspot.com/114560043/diff/580001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode139 replay.py:139: assert isinstance(rule, list) On 2014/08/12 16:01:39, wrightt wrote: > as noted elsewhere, replace assert w/ raise. once nice thing is you have "i" as > the line number, so it'll be easy to print that in the exception. Done. https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode143 replay.py:143: assert (len(rule) == 4) or (len(rule) == 3) On 2014/08/12 16:01:39, wrightt wrote: > perhaps overly clever, but can use 3 <= len(rule) < 4 Done. https://codereview.appspot.com/114560043/diff/580001/replay.py#newcode557 replay.py:557: harness_group.add_option('--json_rules', default='json_rules.txt', On 2014/08/12 16:01:39, wrightt wrote: > s/.txt/.json/ > > I guess that implies renaming your commit file, too... :/ Done. Sign in to reply to this message.
I removed all the example rules in this review. I will do a follow up review that has example rules for a certain website. Sign in to reply to this message.
https://codereview.appspot.com/114560043/diff/620001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/620001/httparchive.py#newcode488 httparchive.py:488: self.trimmed_headers = self._TrimHeaders(headers, additional_undesirable_keys) wrap 80 chars. https://codereview.appspot.com/114560043/diff/620001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode135 replay.py:135: with open(options.json_rules, 'r') as json_file: As discussed in today's VC, support multiple files: There are many options re: the args format, but how about we simply do: * if contains ',' then split by commas, e.g.: 'a.json,b.json' --> ['a.json', 'b.json'] # ignore escaping issues * if os.path.isdir(...) then os.listdir(...) the endswith '.json', e.g.: '/foo' --> ['/foo/a.json', '/foo/b.json', ...] # load '/foo/*.json' https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode137 replay.py:137: comments = [] nit: instead of pop'ing the comments, do: json_data = json.load(json_file) and only append the non-comments to json_rules. https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode139 replay.py:139: if not isinstance(rule, list): instead of [["#comment"], ...] do ["#comment", ...] https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode141 replay.py:141: if (rule[0] == '_comment') | (rule[0][0] == '#'): change "|" to "or" https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode141 replay.py:141: if (rule[0] == '_comment') | (rule[0][0] == '#'): want "\s*#", e.g. allow " # blah". https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode144 replay.py:144: assert 3 <= len(rule) <= 4 change assert to if ... raise https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode560 replay.py:560: help='Path of file containing json rules to modify urls.') update help w/ above multi-file support. Sign in to reply to this message.
I am having a little trouble processing this change (more sleep would help). It feels wrong to add "path_for_matching" to the archive. It seems like that should be done without changing the archive. This also looks like it adds a fair amount of work for every request. Maybe I can have some more ideas when I look at it tomorrow, however, I know Mary is out of time. https://codereview.appspot.com/114560043/diff/640001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode466 httparchive.py:466: is_ssl=False, path_for_matching=None, path_for_matching -> repr_path ? https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode467 httparchive.py:467: additional_undesirable_keys=None): additional_undesirable_keys -> exclude_headers https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode609 httparchive.py:609: Add "Args:" section to the doc string with additional_undesirable_keys. https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode613 httparchive.py:613: - accept-charset, accept-language, referer: vary between clients. Add cache-control back. https://codereview.appspot.com/114560043/diff/640001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode467 httpclient.py:467: def mutate_response(request, response, callback_paths): mutate_response -> modify_response https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode474 httpclient.py:474: newkey = request.full_path.rsplit('callback=_xdc_\._', 1)[1] How about making this a general rule? https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode474 httpclient.py:474: newkey = request.full_path.rsplit('callback=_xdc_\._', 1)[1] newkey -> new_key oldkey -> old_key https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode475 httpclient.py:475: resp_text = response.get_response_as_text() resp_text -> text new_resp_text -> new_text alternatively, response_text new_response_text "resp" (and most abbreviations) make me think too much. https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode520 httpclient.py:520: raise ValueError('Invalid %s type for %s should be %s instead' This means that a bad rule would kill the server? https://codereview.appspot.com/114560043/diff/640001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/640001/httpproxy.py#newcode93 httpproxy.py:93: path_for_matching = full_path Why pass path_for_matching instead of simple changing the full_path? https://codereview.appspot.com/114560043/diff/640001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode144 replay.py:144: assert 3 <= len(rule) <= 4 assert len(rule) in (3, 4) https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode145 replay.py:145: [json_rules.pop(i) for i in comments[::-1]] List comprehensions are not typically used from side-effects. Sign in to reply to this message.
> It feels wrong to add "path_for_matching" to the archive. It seems like > that should be done without changing the archive. Mary and I discussed this; it dates back to Ryan's impl in: cs/tactile/httparchive.py:450 I'll create a doc w/ the pros/cons of various design options. https://codereview.appspot.com/114560043/diff/640001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode474 httpclient.py:474: newkey = request.full_path.rsplit('callback=_xdc_\._', 1)[1] On 2014/08/15 00:05:40, slamm1 wrote: > How about making this a general rule? We sketched a generic version of this but decided to defer the impl, to make sure we get the initial impl done in time for Mary's internship. I'll writeup the design. https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode520 httpclient.py:520: raise ValueError('Invalid %s type for %s should be %s instead' On 2014/08/15 00:05:40, slamm1 wrote: > This means that a bad rule would kill the server? Yes, but we want it to die if the rules file is invalid. (seems better than ignoring bad rules, 'though I guess we could warn & ignore 'em...) https://codereview.appspot.com/114560043/diff/640001/httpproxy.py File httpproxy.py (right): https://codereview.appspot.com/114560043/diff/640001/httpproxy.py#newcode93 httpproxy.py:93: path_for_matching = full_path On 2014/08/15 00:05:40, slamm1 wrote: > Why pass path_for_matching instead of simple changing the full_path? Our thought was that the archive should list the fetched URLs, not the trimmed ones. Also see our general discussion re: path_for_matching design. Sign in to reply to this message.
I generalized replace callbacks. So in the rules you can specify which group to replace in the action_args using a regex and you can replace what to replace it with using a regex in the predicate args https://codereview.appspot.com/114560043/diff/640001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode467 httpclient.py:467: def mutate_response(request, response, callback_paths): On 2014/08/15 00:05:40, slamm1 wrote: > mutate_response -> modify_response Done. https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode474 httpclient.py:474: newkey = request.full_path.rsplit('callback=_xdc_\._', 1)[1] I made it more general. So now you specify the path in MatchUrl with a group identifying the new_key, and as an action arg you can specify a regex with a group identifying the old_key in the response. On 2014/08/15 00:05:40, slamm1 wrote: > How about making this a general rule? https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode474 httpclient.py:474: newkey = request.full_path.rsplit('callback=_xdc_\._', 1)[1] On 2014/08/15 00:05:40, slamm1 wrote: > newkey -> new_key > oldkey -> old_key Done. https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode475 httpclient.py:475: resp_text = response.get_response_as_text() On 2014/08/15 00:05:40, slamm1 wrote: > resp_text -> text > new_resp_text -> new_text > > alternatively, > > response_text > new_response_text > > "resp" (and most abbreviations) make me think too much. > Done. https://codereview.appspot.com/114560043/diff/640001/httpclient.py#newcode520 httpclient.py:520: raise ValueError('Invalid %s type for %s should be %s instead' On 2014/08/15 14:38:55, wrightt wrote: > On 2014/08/15 00:05:40, slamm1 wrote: > > This means that a bad rule would kill the server? > > Yes, but we want it to die if the rules file is invalid. (seems better than > ignoring bad rules, 'though I guess we could warn & ignore 'em...) Done. https://codereview.appspot.com/114560043/diff/640001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode144 replay.py:144: assert 3 <= len(rule) <= 4 On 2014/08/15 00:05:40, slamm1 wrote: > assert len(rule) in (3, 4) Done. https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode145 replay.py:145: [json_rules.pop(i) for i in comments[::-1]] On 2014/08/15 00:05:40, slamm1 wrote: > List comprehensions are not typically used from side-effects. Done. Sign in to reply to this message.
I generalized replace callbacks. So in the rules you can specify which group to replace in the action_args using a regex and you can replace what to replace it with using a regex in the predicate args Sign in to reply to this message.
LGTM, only a couple notes below. https://codereview.appspot.com/114560043/diff/620001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode135 replay.py:135: with open(options.json_rules, 'r') as json_file: add TODO(wrightt): re: above multi-file support. https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode137 replay.py:137: comments = [] please impl above: > instead of pop'ing the comments, do: > json_data = json.load(json_file) > and only append the non-comments to json_rules. to address Steve's feedback below re: "don't pop in list_comprehension]" https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode139 replay.py:139: if not isinstance(rule, list): please impl to allow " # comment" instead of ["#comment"] I.e., drop the list wrapper, it's cleaner. https://codereview.appspot.com/114560043/diff/620001/replay.py#newcode141 replay.py:141: if (rule[0] == '_comment') | (rule[0][0] == '#'): please fix "|" to "or", b/c it's a bug as-is. https://codereview.appspot.com/114560043/diff/640001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode135 replay.py:135: with open(options.json_rules, 'r') as json_file: add TODO(wrightt): re: support comma-separated file names & dirs https://codereview.appspot.com/114560043/diff/680001/httpclient.py File httpclient.py (right): https://codereview.appspot.com/114560043/diff/680001/httpclient.py#newcode471 httpclient.py:471: logging.info('loooooooooooook ' +request.full_path) :) Sign in to reply to this message.
https://codereview.appspot.com/114560043/diff/640001/httparchive.py File httparchive.py (right): https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode466 httparchive.py:466: is_ssl=False, path_for_matching=None, On 2014/08/15 00:05:40, slamm1 wrote: > path_for_matching -> repr_path ? Done. https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode467 httparchive.py:467: additional_undesirable_keys=None): On 2014/08/15 00:05:40, slamm1 wrote: > additional_undesirable_keys -> exclude_headers Done. https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode609 httparchive.py:609: On 2014/08/15 00:05:40, slamm1 wrote: > Add "Args:" section to the doc string with additional_undesirable_keys. Done. https://codereview.appspot.com/114560043/diff/640001/httparchive.py#newcode613 httparchive.py:613: - accept-charset, accept-language, referer: vary between clients. On 2014/08/15 00:05:40, slamm1 wrote: > Add cache-control back. Done. https://codereview.appspot.com/114560043/diff/640001/replay.py File replay.py (right): https://codereview.appspot.com/114560043/diff/640001/replay.py#newcode135 replay.py:135: with open(options.json_rules, 'r') as json_file: On 2014/08/15 19:44:39, wrightt wrote: > add TODO(wrightt): re: support comma-separated file names & dirs Done. Sign in to reply to this message.
Steve, ping? I somehow thought we pushed this months ago... :( I can rebase to the latest master & submit a new CR. More generally, design feedback? I can maybe do a little cleanup this quarter, but anything significant would have to wait until next quarter. Sign in to reply to this message. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
