Skip to content

Conversation

@samolds
Copy link

@samolds samolds commented Nov 18, 2021

I'm working with a non-standard EDI format that includes a segment prefix. For example, a message might be:

|HDR|1|2|3| |DAT|X| |EOF| 

where every segment begins with a pipe. I thought that I could get around this by making the segment delimiter include the next pipe (ie |\n|), but this doesn't catch the very first pipe.

I propose including a new (optional) "segment_prefix" field in the file_declaration to catch segment prefixes.

@jf-tech
Copy link
Owner

jf-tech commented Nov 19, 2021

First thank you very much for using omniparser, analyzing your issue and proposing a solution! Really appreciate it! In general, I'd like to have an issue opened for in depth discussion, before PR is determined to be needed and created.

I'd like to understand a bit more of your specific problem: EDI uses segment delimiter and element delimiter to compartmentalize data fragments. In your example however, I'm not seeing element delimiter. Is your EDI has only seg delim? That's highly unusual (not even sure if we should call it EDI format or any more). Or is it your EDI really uses \n (\r\n on Windows) as seg delim, and uses | pipe as element delim? I'm not sure, I feel it's the latter given each of your segments is occupying a single line. If that's the case, the issue here is your segment name isn't the first element but rather the second element (and first element is empty).

Given the guess/analysis above, I have three solutions in mind:

  1. intro an optional setting file_declaration.segment_name_index, defaulting to 0, but in your case you can specify 1, and all subsequent element index references will be bumped up by 1 as well.

  2. you or I create a generic io.Reader implementation named "PrefixStrippingReader" which strips a given prefix at the beginning of each line. A suitable place for this reader could be here: https://github.com/jf-tech/go-corelib/blob/master/ios/readers.go

  3. slightly hacky: If you can somehow ensure your EDI file starts with an \n (or \r\n on Windows) then we can directly use this io.Reader implementation: https://github.com/jf-tech/go-corelib/blob/master/ios/bytesReplacingReader.go. Basically you create this reader to replace every \n| into \n, and then give this reader to omniparser.

Given this very very non-standard EDI structure in your situation, I'm a bit inclined toward option 2 since that file_declaration.segment_name_index setting would be all but guaranteed to be only used by you.

What do you think, @samolds ?

@samolds
Copy link
Author

samolds commented Nov 19, 2021

Sorry for jumping the gun and opening a PR straight away! I had already gone ahead with the changes in order to meet a demo deadline and thought I would try and contribute back, even if this is some weird EDI flavor.

I have a specification that states:

Record Delimiter PIPE ‘|’ All records start with Pipe and end with Pipe, New Line ‘|\n’, or carriage return,
Field Delimiter PIPE ‘|’ All fields start with Pipe
All files end with '|EOF|' at end of file, after a carriage return

And then proceeds to define all of the segments and elements and loops and various rules for the expected data types, min, max, format checks, etc.

Following through your EDI In Depth wiki, I created an EDI parsing spec (simplified here for brevity) that successfully transforms the raw data:

{ "parser_settings": { "version": "omni.2.1", "file_format_type": "edi" }, "file_declaration": { "element_delimiter": "|", "segment_prefix": "|", "segment_delimiter": "\n", "ignore_crlf": false, "segment_declarations": [ { "name": "document", "type": "segment_group", "min": 1, "max": -1, "is_target": true, "child_segments": [ { "name": "DOC_TYPE", "elements": [ { "name": "name", "index": 1 }, { "name": "timestamp", "index": 3 }, { "name": "version", "index": 8 } ] }, { "name": "record", "type": "segment_group", "min": 1, "max": -1, "child_segments": [ { "name": "REC", "elements": [ { "name": "record_id", "index": 2 } ] }, { "name": "header", "type": "segment_group", "min": 1, "max": 1, "child_segments": [ { "name": "HDR", "elements": [ { "name": "part_number", "index": 1 } ] } ] } ] }, { "name": "EOF" } ] } ] }, "transform_declarations": { "FINAL_OUTPUT": { "object": { "name": { "xpath": "DOC_TYPE/name" } }} } }

So yes, my spec uses \n as the segment delimiter, and | for the element delimiter. But then it has this pesky | at the beginning of the segments as well. I understand not wanting to include bizarre functionality in the main parser if it's unlikely to be used by anyone else haha. I'm fairly new to the world of EDI and wasn't sure if this was a potentially common kind of thing.

I'm cool with option # 2. I will try and find some time to make those changes, but I might have to continue using my fork until I get around to it.

jf-tech added a commit to jf-tech/go-corelib that referenced this pull request Nov 27, 2021
…anism. LineReader implements io.Reader interface with a line editing mechanism. LineReader reads data from underlying io.Reader and invokes the caller supplied edit function for each of the line (defined as []byte ending with '\n', therefore it works on both Mac/Linux and Windows, where '\r\n' is used). Note the last line before EOF will be edited as well even if it doesn't end with '\n'. Usage is highly flexible: the editing function can do in-place editing such as character replacement, prefix/suffix stripping, or word replacement, etc., as long as the line length isn't changed; or it can replace a line with a completely newly allocated and written line with no length restriction (although performance would be slower compared to in-place editing). ios.LineReader is at least as performant as ios.BytesReplacingReader: ``` BenchmarkLineReader_RawIORead-8 23300 51319 ns/op 1103392 B/op 23 allocs/op BenchmarkLineReader_UseLineReader-8 3343 351305 ns/op 1104512 B/op 25 allocs/op BenchmarkLineReader_CompareWithBytesReplacingReader-8 978 1226656 ns/op 1107648 B/op 26 allocs/op ``` This PR is motivated from real usage case discussed in jf-tech/omniparser#154
@jf-tech
Copy link
Owner

jf-tech commented Nov 27, 2021

@samolds if you have time, do you mind taking a look at jf-tech/go-corelib#22 where I introduce a LineReader with editing mechanism, as we discussed in this PR before. Let me know if you think it would fit your need.

@samolds
Copy link
Author

samolds commented Nov 30, 2021

jf-tech/go-corelib#22 is a good solution. I am going to close this PR in favor of using the LineEditingReader introduced in the other PR. Thanks!

@samolds samolds closed this Nov 30, 2021
jf-tech added a commit to jf-tech/go-corelib that referenced this pull request Nov 30, 2021
…ng mechanism (#22) `LineEditingReader` implements `io.Reader` interface with a line editing mechanism. `LineEditingReader` reads data from underlying `io.Reader` and invokes the caller supplied edit function for each of the line (defined as `[]byte` ending with `'\n'`, therefore it works on both Mac/Linux and Windows, where `'\r\n'` is used). Note the last line before `EOF` will be edited as well even if it doesn't end with `'\n'`. Usage is highly flexible: the editing function can do in-place editing such as character replacement, prefix/suffix stripping, or word replacement, etc., as long as the line length isn't increased; or it can replace a line with a completely newly allocated and written line with no length restriction (although performance might be slower compared to in-place editing). `ios.LineEditingReader` is at least as performant as `ios.BytesReplacingReader`: ``` BenchmarkLineEditingReader_RawIORead-8 23300 51319 ns/op 1103392 B/op 23 allocs/op BenchmarkLineEditingReader_UseLineEditingReader-8 3343 351305 ns/op 1104512 B/op 25 allocs/op BenchmarkLineEditingReader_CompareWithBytesReplacingReader-8 978 1226656 ns/op 1107648 B/op 26 allocs/op ``` This PR is motivated from real usage case discussed in jf-tech/omniparser#154
@jf-tech
Copy link
Owner

jf-tech commented Dec 1, 2021

FYI, @samolds https://github.com/jf-tech/go-corelib v0.0.16 that contains the LineEditingReader has been released. Let me know if you encounter any issues/bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants