- Notifications
You must be signed in to change notification settings - Fork 508
[PostgreSQL] Enhance grok pattern #10412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PostgreSQL] Enhance grok pattern #10412
Conversation
🚀 Benchmarks reportTo see the full report comment with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
Please resolve merge conflicts!
| /test |
| /test |
| ignore_missing: true | ||
| patterns: | ||
| - '^(\[%{NUMBER:process.pid:long}(-%{BASE16FLOAT:postgresql.log.session_line_number:long})?\] ((\[%{USERNAME:user.name}\]@\[%{POSTGRESQL_DB_NAME:postgresql.log.database}\]|%{USERNAME:user.name}@%{POSTGRESQL_DB_NAME:postgresql.log.database}) )?)?%{WORD:log.level}: (?:%{POSTGRESQL_ERROR:postgresql.log.sql_state_code}|%{SPACE})(duration: %{NUMBER:temp.duration:float} ms %{POSTGRESQL_QUERY_STEP}: %{GREEDYDATA:postgresql.log.query}|: %{GREEDYDATA:message}|%{GREEDYDATA:message})' | ||
| - '^(\[%{NUMBER:process.pid:long}(-%{BASE16FLOAT:postgresql.log.session_line_number:long})?\]:? ?)?(\[%{NUMBER:postgresql.log.session_line_number:long}(-%{BASE16FLOAT:postgresql.log.sequence_number:long})?\] )?((\[%{USERNAME:user.name}\]@\[%{POSTGRESQL_DB_NAME:postgresql.log.database}\]|%{USERNAME:user.name}@%{POSTGRESQL_DB_NAME:postgresql.log.database} )?)?((db=((\[%{DATA:postgresql.log.database}\])|%{DATA:postgresql.log.database})?,user=((\[%{DATA:user.name}\])|%{DATA:user.name})?,app=((\[%{DATA:postgresql.log.application_name}\])|%{DATA:postgresql.log.application_name})?,client=((\[%{DATA:postgresql.log.client_addr}\])|%{DATA:postgresql.log.client_addr})?)? ?)?%{WORD:log.level}: (?:%{POSTGRESQL_ERROR:postgresql.log.sql_state_code}|%{SPACE})(duration: %{NUMBER:temp.duration:float} ms %{POSTGRESQL_QUERY_STEP}: %{GREEDYDATA:postgresql.log.query}|: %{GREEDYDATA:message}|%{GREEDYDATA:message})' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curios why we are not considering timestamp? Any idea? As simple as this: ^%{TIMESTAMP_ISO8601:timestamp} %{TZ:timezone}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already being extracted in default pipeline and then we are extracting more details from remaining raw message.
Co-authored-by: subham sarkar <sarkar.subhams2@gmail.com>
| AFAIK, we do not pack support for custom log formats in our integration packages, right? |
| # newer versions go on top | ||
| - version: "1.24.0" | ||
| changes: | ||
| - description: Fix log datastream grok pattern to support standard logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with standard, aren't you also handling the custom formats like the one mentioned in the SDH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are!
Co-authored-by: subham sarkar <sarkar.subhams2@gmail.com>
Co-authored-by: subham sarkar <sarkar.subhams2@gmail.com>
Co-authored-by: subham sarkar <sarkar.subhams2@gmail.com>
Co-authored-by: subham sarkar <sarkar.subhams2@gmail.com>
Yes, we are not. But there are 3 new fields that we have seen in the client logs appeared in the two different SDHs that seems crucial. Also we saw the new timezones which is not the default like UTC/IST. We have supported those new timezones (CEST, EST, +02:00 and +02). Refer. The changes that we have made is not affecting our existing behavior. You can refer the generated pipeline tests. |
For defaults, UTC is fine. But IST or any other shouldn't be considered a default. Also, if I see correctly there are more such time zones. See how the "date" processor is considering the timezone: https://github.com/elastic/elasticsearch/blob/2871b94603655ff3b6120788728bfbf7c009d27f/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/DateProcessor.java#L96 So, I was playing with the script and made some changes and noticed that some results in the pipeline tests are incorrect. I'll share the patch with you here, please confirm: diff --git a/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/default.yml b/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/default.yml index 523134dfe9..23447adaad 100644 --- a/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/default.yml +++ b/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/default.yml @@ -33,43 +33,53 @@ processors: UTC as before. if: ctx._temp_?.timestamp != null source: |- - String get_timezone(def ctx) { + String get_timezone(def ctx) { if (ctx.event?.timezone != null) { - - // if timezone is in format of CEST,EST etc. - if (ctx.event.timezone.length() <= 4) { - // all time zone abbreviations need to be uppercase - return ctx.event.timezone.toUpperCase(); + String tz = ctx.event.timezone.trim().toUpperCase(); + + // Handle abbreviations (e.g., CEST, EST) + if (tz.length() <= 4) { + return tz; } - // if timezone is in format of +00:00, -00:00, +00 & -00 - if(ctx.event?.timezone.contains('+') || ctx.event?.timezone.contains('-')) { - // if timezone is in format of +00 or -00 - if (ctx.event.timezone.length() == 3) { - return "UTC"+ctx.event.timezone+":00"; - } - return "UTC"+ctx.event.timezone; + // Handle offset formats (+00:00, -00:00, +00, -00) + if (tz.startsWith('+') || tz.startsWith('-')) { + return tz.length() == 3 ? "UTC" + tz + ":00" : "UTC" + tz; } - - return ctx.event.timezone; + + return tz; } - + ctx.event.timezone = 'UTC'; return 'UTC'; } - + def event_timezone = get_timezone(ctx); - if (!(event_timezone.contains('+')) && !(event_timezone.contains('-')) && !(event_timezone.length() > 4)) { - // timezone abbreviation e.g. CEST need to be put inside the timestamp - SimpleDateFormat sdf = new SimpleDateFormat("z"); - sdf.parse(event_timezone); - ctx._temp_.date_timezone = ZoneId.of(sdf.getTimeZone().getID(), ZoneId.SHORT_IDS).getId(); - ctx?._temp_.timestamp = ctx?._temp_.timestamp + " " + event_timezone; + boolean isAbbreviation = !(event_timezone.contains('+') || event_timezone.contains('-') || event_timezone.length() > 4); + + if (isAbbreviation) { + Map CUSTOM_ZONE_IDS = new HashMap(ZoneId.SHORT_IDS); + CUSTOM_ZONE_IDS.put("CEST", "Europe/Paris"); + CUSTOM_ZONE_IDS.put("CET", "Europe/Paris"); + CUSTOM_ZONE_IDS.put("BST", "Europe/London"); + CUSTOM_ZONE_IDS.put("IST", "Asia/Kolkata"); + CUSTOM_ZONE_IDS.put("EEST", "Europe/Athens"); + CUSTOM_ZONE_IDS.put("EET", "Europe/Athens"); + CUSTOM_ZONE_IDS.put("AEST", "Australia/Sydney"); + CUSTOM_ZONE_IDS.put("AEDT", "Australia/Sydney"); + CUSTOM_ZONE_IDS.put("ACST", "Australia/Adelaide"); + CUSTOM_ZONE_IDS.put("ACDT", "Australia/Adelaide"); + CUSTOM_ZONE_IDS.put("AWST", "Australia/Perth"); + + // Timezone abbreviation (e.g., CEST) needs to be put inside the timestamp + ZoneId zoneId = ZoneId.of(event_timezone, CUSTOM_ZONE_IDS); + ctx._temp_.date_timezone = zoneId.getId(); + ctx?._temp_.timestamp += " " + event_timezone; } else { - // timezone is either abbreviation+-offset e.g. UTC+1 or long representation - // e.g. Europe/Athens needs to be put as a ZoneId and *not* inside the timestamp + // Timezone is either abbreviation+-offset (e.g., UTC+1) or long representation (e.g., Europe/Athens) ctx._temp_.date_timezone = event_timezone; - } + } + - pipeline: name: '{{ IngestPipeline "pipeline-log" }}' if: ctx.separator != ',' && ctx.separator != ':' @@ -86,8 +96,8 @@ processors: formats: - yyyy-MM-dd HH:mm:ss.SSS zz - yyyy-MM-dd HH:mm:ss zz - - yyyy-MM-dd HH:mm:ss.SSS - - yyyy-MM-dd HH:mm:ss + - yyyy-MM-dd HH:mm:ss.SSS + - yyyy-MM-dd HH:mm:ss on_failure: - set: field: error.message diff --git a/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/pipeline-log.yml b/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/pipeline-log.yml index 7e697a59cc..51693ff1e3 100644 --- a/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/pipeline-log.yml +++ b/packages/postgresql/data_stream/log/elasticsearch/ingest_pipeline/pipeline-log.yml @@ -16,11 +16,11 @@ processors: - kv: tag: kv_log_database_info field: _temp_.database_connection_str - target_field: _temp_.database_connection_obj + target_field: _temp_.database_connection_obj field_split: ',' value_split: '=' ignore_missing: true - strip_brackets: true + strip_brackets: true description: Splits PostgreSQL log info string into key-value pairs. ignore_failure: true on_failure: @@ -32,16 +32,16 @@ processors: - rename: field: _temp_.database_connection_obj.app target_field: postgresql.log.application_name - ignore_missing: true + ignore_missing: true - rename: field: _temp_.database_connection_obj.client target_field: postgresql.log.client_addr - ignore_missing: true + ignore_missing: true - rename: field: _temp_.database_connection_obj.db target_field: postgresql.log.database - ignore_missing: true + ignore_missing: true - rename: field: _temp_.database_connection_obj.user target_field: user.name - ignore_missing: true \ No newline at end of file + ignore_missing: true \ No newline at end of fileThis is a diff patch. You can directly apply this using git commands. Example failures: Search for Also, had to introduce a custom map because https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS only supports a few; not all. |
| Except for my previous comment, the rest of the changes look good. |
| @shmsr Thanks for the suggestion. Based on your provided changes I have implemented the pipeline tests, here are the result of the current pipeline behavior vs suggested behavior and please observe the Current behaviour Suggested behavior |
| ignore_missing: true | ||
| - rename: | ||
| field: _temp_.database_connection_obj.client | ||
| target_field: postgresql.log.client_addr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be converted to ip type? Also would it make sense here to remove/cleanup _temp_ after you have finished renaming these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be converted to ip type?
There are values like local for this field. Also there are IP values as well. I have updated the pipeline that can set IP values to related.ip field and the keyword values can be set to related.host field.
Also would it make sense here to remove/cleanup temp after you have finished renaming these fields?
Yes, after renaming the fields there are some fields still remaining that needed cleanup in _temp_. Hence removed _temp_.
💚 Build Succeeded
History
|
|
Didn't understand this.
Did you check this ^^? Previously, EST's TZ offset was shown as -05:00 but if you search EST is actually -04:00 Also, let me take a look at the changes again. I have lost some context over 2 weeks. And sorry for missing out the re-review; didn't get the notification. |
Yes @shmsr I have checked your previous comment and as shown in above comment, both timestamps are logically same if you notice the hour of that timestamp. In the -04:00 the hour was 13:36 and in the -05:00 the hour was 12:36. I have already made changes as per your suggestions which you can see in the updated PR. |
Thanks Harnish. But I am thinking of this: https://stackoverflow.com/a/68821105/5821408 For 3 letter timezones, it is tricky, that's why 4 letter timezones are there to clear it up. For example:
Same for IST; it could be Indian Standard Time or maybe even Israel Standard Time. And depending on what we choose to map, we will get the offset accordingly. Change looks good. I don't think we can do much here. In case, there's ever an SDH/ clarification required, we can point them to assumed defaults. Approving this PR. |
| How does one go about adding to the list of CUSTOM_ZONE_IDS? Since this update we have been receiving: If I update the |
* fix Grok pattern Failure Co-authored-by: harnish-elastic <harnish.chavda@elastic.co> Co-authored-by: Harnish Chavda <118714680+harnish-elastic@users.noreply.github.com>
* fix Grok pattern Failure Co-authored-by: harnish-elastic <harnish.chavda@elastic.co> Co-authored-by: Harnish Chavda <118714680+harnish-elastic@users.noreply.github.com>




Description
PostgreSQL supports several methods for logging server messages, including stderr, csvlog, jsonlog, and syslog. On Windows, eventlog is also supported. Set this parameter to a list of desired log destinations separated by commas. The default is to log to stderr only. This parameter can only be set in the postgresql.conf file or on the server command line.
cf.https://www.postgresql.org/docs/current/runtime-config-logging.html
raw message example :
input.type,log.flagsandlog.offsetfields are not defined in the agent.yml file. Hence added it to agent.yml file.Related Issues