supports a "segment_prefix" in the edi parser file declaration #154

samolds · 2021-11-18T19:31:19Z

I'm working with a non-standard EDI format that includes a segment prefix. For example, a message might be:

|HDR|1|2|3| |DAT|X| |EOF|

where every segment begins with a pipe. I thought that I could get around this by making the segment delimiter include the next pipe (ie |\n|), but this doesn't catch the very first pipe.

I propose including a new (optional) "segment_prefix" field in the file_declaration to catch segment prefixes.

jf-tech · 2021-11-19T05:06:23Z

First thank you very much for using omniparser, analyzing your issue and proposing a solution! Really appreciate it! In general, I'd like to have an issue opened for in depth discussion, before PR is determined to be needed and created.

I'd like to understand a bit more of your specific problem: EDI uses segment delimiter and element delimiter to compartmentalize data fragments. In your example however, I'm not seeing element delimiter. Is your EDI has only seg delim? That's highly unusual (not even sure if we should call it EDI format or any more). Or is it your EDI really uses \n (\r\n on Windows) as seg delim, and uses | pipe as element delim? I'm not sure, I feel it's the latter given each of your segments is occupying a single line. If that's the case, the issue here is your segment name isn't the first element but rather the second element (and first element is empty).

Given the guess/analysis above, I have three solutions in mind:

intro an optional setting file_declaration.segment_name_index, defaulting to 0, but in your case you can specify 1, and all subsequent element index references will be bumped up by 1 as well.
you or I create a generic io.Reader implementation named "PrefixStrippingReader" which strips a given prefix at the beginning of each line. A suitable place for this reader could be here: https://github.com/jf-tech/go-corelib/blob/master/ios/readers.go
slightly hacky: If you can somehow ensure your EDI file starts with an \n (or \r\n on Windows) then we can directly use this io.Reader implementation: https://github.com/jf-tech/go-corelib/blob/master/ios/bytesReplacingReader.go. Basically you create this reader to replace every \n| into \n, and then give this reader to omniparser.

Given this very very non-standard EDI structure in your situation, I'm a bit inclined toward option 2 since that file_declaration.segment_name_index setting would be all but guaranteed to be only used by you.

What do you think, @samolds ?

samolds · 2021-11-19T23:13:01Z

Sorry for jumping the gun and opening a PR straight away! I had already gone ahead with the changes in order to meet a demo deadline and thought I would try and contribute back, even if this is some weird EDI flavor.

I have a specification that states:

Record Delimiter PIPE ‘|’ All records start with Pipe and end with Pipe, New Line ‘|\n’, or carriage return,
Field Delimiter PIPE ‘|’ All fields start with Pipe
All files end with '|EOF|' at end of file, after a carriage return

And then proceeds to define all of the segments and elements and loops and various rules for the expected data types, min, max, format checks, etc.

Following through your EDI In Depth wiki, I created an EDI parsing spec (simplified here for brevity) that successfully transforms the raw data:

{ "parser_settings": { "version": "omni.2.1", "file_format_type": "edi" }, "file_declaration": { "element_delimiter": "|", "segment_prefix": "|", "segment_delimiter": "\n", "ignore_crlf": false, "segment_declarations": [ { "name": "document", "type": "segment_group", "min": 1, "max": -1, "is_target": true, "child_segments": [ { "name": "DOC_TYPE", "elements": [ { "name": "name", "index": 1 }, { "name": "timestamp", "index": 3 }, { "name": "version", "index": 8 } ] }, { "name": "record", "type": "segment_group", "min": 1, "max": -1, "child_segments": [ { "name": "REC", "elements": [ { "name": "record_id", "index": 2 } ] }, { "name": "header", "type": "segment_group", "min": 1, "max": 1, "child_segments": [ { "name": "HDR", "elements": [ { "name": "part_number", "index": 1 } ] } ] } ] }, { "name": "EOF" } ] } ] }, "transform_declarations": { "FINAL_OUTPUT": { "object": { "name": { "xpath": "DOC_TYPE/name" } }} } }

So yes, my spec uses \n as the segment delimiter, and | for the element delimiter. But then it has this pesky | at the beginning of the segments as well. I understand not wanting to include bizarre functionality in the main parser if it's unlikely to be used by anyone else haha. I'm fairly new to the world of EDI and wasn't sure if this was a potentially common kind of thing.

I'm cool with option # 2. I will try and find some time to make those changes, but I might have to continue using my fork until I get around to it.

…anism. LineReader implements io.Reader interface with a line editing mechanism. LineReader reads data from underlying io.Reader and invokes the caller supplied edit function for each of the line (defined as []byte ending with '\n', therefore it works on both Mac/Linux and Windows, where '\r\n' is used). Note the last line before EOF will be edited as well even if it doesn't end with '\n'. Usage is highly flexible: the editing function can do in-place editing such as character replacement, prefix/suffix stripping, or word replacement, etc., as long as the line length isn't changed; or it can replace a line with a completely newly allocated and written line with no length restriction (although performance would be slower compared to in-place editing). ios.LineReader is at least as performant as ios.BytesReplacingReader: ``` BenchmarkLineReader_RawIORead-8 23300 51319 ns/op 1103392 B/op 23 allocs/op BenchmarkLineReader_UseLineReader-8 3343 351305 ns/op 1104512 B/op 25 allocs/op BenchmarkLineReader_CompareWithBytesReplacingReader-8 978 1226656 ns/op 1107648 B/op 26 allocs/op ``` This PR is motivated from real usage case discussed in jf-tech/omniparser#154

jf-tech · 2021-11-27T04:15:36Z

@samolds if you have time, do you mind taking a look at jf-tech/go-corelib#22 where I introduce a LineReader with editing mechanism, as we discussed in this PR before. Let me know if you think it would fit your need.

samolds · 2021-11-30T22:30:06Z

jf-tech/go-corelib#22 is a good solution. I am going to close this PR in favor of using the LineEditingReader introduced in the other PR. Thanks!

…ng mechanism (#22) `LineEditingReader` implements `io.Reader` interface with a line editing mechanism. `LineEditingReader` reads data from underlying `io.Reader` and invokes the caller supplied edit function for each of the line (defined as `[]byte` ending with `'\n'`, therefore it works on both Mac/Linux and Windows, where `'\r\n'` is used). Note the last line before `EOF` will be edited as well even if it doesn't end with `'\n'`. Usage is highly flexible: the editing function can do in-place editing such as character replacement, prefix/suffix stripping, or word replacement, etc., as long as the line length isn't increased; or it can replace a line with a completely newly allocated and written line with no length restriction (although performance might be slower compared to in-place editing). `ios.LineEditingReader` is at least as performant as `ios.BytesReplacingReader`: ``` BenchmarkLineEditingReader_RawIORead-8 23300 51319 ns/op 1103392 B/op 23 allocs/op BenchmarkLineEditingReader_UseLineEditingReader-8 3343 351305 ns/op 1104512 B/op 25 allocs/op BenchmarkLineEditingReader_CompareWithBytesReplacingReader-8 978 1226656 ns/op 1107648 B/op 26 allocs/op ``` This PR is motivated from real usage case discussed in jf-tech/omniparser#154

jf-tech · 2021-12-01T00:01:23Z

FYI, @samolds https://github.com/jf-tech/go-corelib v0.0.16 that contains the LineEditingReader has been released. Let me know if you encounter any issues/bugs.

supports a "segment_prefix" in the edi parser file declaration

15c27a3

samolds force-pushed the master branch from 393169e to 15c27a3 Compare November 18, 2021 19:43

jf-tech mentioned this pull request Nov 27, 2021

Introduce ios.LineEditingReader, an io.Reader wrapper with line editing mechanism jf-tech/go-corelib#22

Merged

samolds closed this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

supports a "segment_prefix" in the edi parser file declaration #154

supports a "segment_prefix" in the edi parser file declaration #154

Uh oh!

samolds commented Nov 18, 2021 •

edited

Loading

jf-tech commented Nov 19, 2021

samolds commented Nov 19, 2021 •

edited

Loading

jf-tech commented Nov 27, 2021

samolds commented Nov 30, 2021

jf-tech commented Dec 1, 2021

Labels

2 participants

Uh oh!

supports a "segment_prefix" in the edi parser file declaration #154

supports a "segment_prefix" in the edi parser file declaration #154

Uh oh!

Conversation

samolds commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jf-tech commented Nov 19, 2021

samolds commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jf-tech commented Nov 27, 2021

samolds commented Nov 30, 2021

jf-tech commented Dec 1, 2021

Labels

2 participants

samolds commented Nov 18, 2021 •

edited

Loading

samolds commented Nov 19, 2021 •

edited

Loading