0

Google is crawling somewhere my URLs with some non-existing garbage URL parameters at the end of canonical URLs.

So, how to leave only 1st URL parameter and remove the rest parameters if there are any with htaccess?

i.e. rewrite these:

https://www.example.com/index.php?p1=p1&p2=p2 or https://www.example.com/index.php?p1=p1&p2=p2&p3=p3 or https://www.example.com/index.php?p1=p1&p2=p2&p3=p3&p4=p4 

to this canonical one:

https://www.example.com/index.php?p1=p1 

Thanks

1
  • Is p1 literally the parameter named p1, or is this just representative of the first URL parameter - whatever that might be? Commented Jun 1 at 12:23

1 Answer 1

1

I take the liberty of assuming you do not want to blindly keep the first key=value pair regardless of the key name. Instead, you want to specifically extract and keep the parameter p1 (if present) and remove all other parameters.

Using mod_rewrite works for a single allowed parameter

As mod_alias cannot manipulate query string, you need mod_rewrite for this. The supplemental documentation on Redirecting and Remapping with mod_rewrite has similar examples for rewriting query strings.

RewriteEngine On RewriteCond %{QUERY_STRING} (^|&)(?!p1=)([^&]+)(&|$) RewriteCond %{QUERY_STRING} (^|&)p1=([^&]+)(&|$) RewriteRule ^index\.php$ /index.php?p1=%2 [R=301,L] 

Breakdown

RewriteCond conditions:

  • The %{QUERY_STRING} contains the query component of the URL (RFC 3986, 3.4).
  • The first condition ensures there are parameters other than p1.
    • This is crucial as it prevents infinite loop.
  • The second condition searches for parameter p1 and captures its value.
    • The first capture group (^|&) ensures we match either the start of the query string or the & that separates parameters, so the rule finds p1 whether it is the first, middle, or last parameter.
    • The p1=([^&]+) searches for key p1 and the second capture group ([^&]+) captures the value as $2.
    • The third capture group (&|$) ensures we stop matching at the next parameter or the end of the string, avoiding over-matching. RewriteRule:
    • As only ?p1=$2 is used in the substitution /index.php?p1=%2, other parameters get removed.
  • The flags [R=301,L] makes this a 301 Moved Permanently (RFC 7231, 6.4.2) redirection and stops processing further rules.

RewriteRule:

  • As only ?p1=$2 is used in the substitution /index.php?p1=%2, other parameters get removed. (It will not hit the first condition anymore.)
  • The flags [R=301,L] makes this a 301 Moved Permanently (RFC 7231, 6.4.2) redirection and stops processing further rules. Keep in mind, that 301 is cached by your browser.

It gets complicated with multiple allowed parameters

Using mod_rewrite works well for removing all parameters except one. However, if you need to keep two or more specific parameters and discard the rest, it quickly gets messy. You must account for all possible permutations of non-empty subsets of the allowed parameters.

Additionally, the rewrite rules must be carefully ordered — starting with cases where more allowed parameters are present — to ensure correct matching.

The following example illustrates this with two allowed parameters (p1, p2), each of which is optional, but any other parameters should be removed if present.

RewriteEngine On # Remove extra parameters when... # ...p1 before p2 RewriteCond %{QUERY_STRING} (^|&)(?!p1=|p2=)([^&]+)(&|$) RewriteCond %{QUERY_STRING} (?:^|&)p1=([^&]+).*?(?:^|&)p2=([^&]+) RewriteRule ^index\.php$ /index.php?p1=%1&p2=%2 [R=301,L] # ...p2 before p1 RewriteCond %{QUERY_STRING} (^|&)(?!p1=|p2=)([^&]+)(&|$) RewriteCond %{QUERY_STRING} (?:^|&)p2=([^&]+).*?(?:^|&)p1=([^&]+) RewriteRule ^index\.php$ /index.php?p1=%2&p2=%1 [R=301,L] # ...only p1 present RewriteCond %{QUERY_STRING} (^|&)(?!p1=|p2=)([^&]+)(&|$) RewriteCond %{QUERY_STRING} (?:^|&)p1=([^&]+) RewriteRule ^index\.php$ /index.php?p1=%1 [R=301,L] # ...only p2 present RewriteCond %{QUERY_STRING} (^|&)(?!p1=|p2=)([^&]+)(&|$) RewriteCond %{QUERY_STRING} (?:^|&)p2=([^&]+) RewriteRule ^index\.php$ /index.php?p2=%1 [R=301,L] # (optional) ...neither p1 nor p2 are present, strip all parameters RewriteCond %{QUERY_STRING} (^|&)(?!p1=|p2=)([^&]+)(&|$) RewriteRule ^index\.php$ /index.php [R=301,L] 

You can see how quickly the complexity grows as the number of allowed parameters increases:

n Allowed parameters Subsets Permutations per subset Total permutations
2 p1, p2 {p1, p2}
{p1}
{p2}
2! = 2 * 1 = 2
1! = 1
1! = 1
4
3 p1, p2, p3 {p1, p2, p3}
{p1, p2}
{p1, p3}
{p2, p3}
{p1}
{p2}
{p3}
3! = 3 * 2 * 1 = 6
2! = 2
2! = 2
2! = 2
1! = 1
1! = 1
1! = 1
15
4 p1, p2, p3, p4 {p1, p2, p3, p4}
...
4! = 4 * 3 * 2 * 1 = 24
...
64
5 p1, ..., p5 5! = 120
...
325
6 p1, ..., p6 6! = 720
...
1956
7 p1, ..., p7 7! = 5040
...
13327
8 p1, ..., p8 8! = 40320
...
108160

...as the total number of permutations across all non-empty subsets of an n-element set comes from this generic formula:

generic formula

Because it is not practical nor efficient to have this many sets of mod_rewrite rules, I would suggest solving this in PHP, instead.

Using PHP

As this /index.php is a PHP script, you could also normalize your parameters at the beginning of the script. This is more flexible as it does not have the limitations of mod_rewrite with multiple allowed parameters.

<?php $allowed_params = ['p1', 'p2']; // you can add more $current_params = $_GET; $filtered_params = array_intersect_key( $current_params, array_flip($allowed_params) ); // Optional: sort filtered params by allowed_params order uksort($filtered_params, function($a, $b) use ($allowed_params) { return array_search($a, $allowed_params) <=> array_search($b, $allowed_params); }); $normalized_query = http_build_query($filtered_params); $current_query = $_SERVER['QUERY_STRING']; $normalized_url = $_SERVER['PHP_SELF'] . ($normalized_query ? '?' . $normalized_query : ''); if ($current_query !== $normalized_query) { header("Location: $normalized_url", true, 301); exit; } // Continue with normal logic... 
4
  • "Apache avoids infinite redirects by not redirecting to the exact same URL again unless you force it" - Not sure what you mean by that? The first rule above is a redirect loop since the RewriteCond pattern (^|&)p1=([^&]+)(&|$) also matches the redirected query string p1=p1 (and redirects to the same). Commented Jun 1 at 12:38
  • "you can add a guard against looping by checking the query string has no other parameters" - The logic seems to be wrong here. And this isn't a "guard", this is mandatory. You need to check that the URL does contain other parameters (since this is a canonical redirect to remove the other params). The 2nd condition in your 2nd rule is checking that p1 is not present (at the start of the query string), which conflicts with the 1st condition (so the rule does nothing - most of the time). Commented Jun 1 at 12:39
  • Since the "garbage URL parameters at the end" then the easiest solution would perhaps be to just check for a trailing & after the first URL param. eg. ^p1=([^&]+)&. NB: Since this is .htaccess you will need to remove the slash prefix from the RewriteRule pattern. In your 3rd rule there is no %5 backreference (on Apache). The % backreferences only come from the last matched CondPattern. You would need to "copy" the first backreference to the second condition. (I would avoid capturing groups that you are not interested in and affect the numbering of the backreferences.) Commented Jun 1 at 12:48
  • @MrWhite Thanks for your input. I think I have corrected my mistakes now. Commented Jun 1 at 21:03

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.