Lex and parse in terms of bytes rather than characters #57

edsrzf · 2022-09-13T08:28:44Z

Fixes #26.

I've tried to do as straightforward of conversion as possible so this might be non-optimal in some aspects, but I figure we can do further refactoring later. All tests pass.

One thing that's worse now is test failures. The Debug impl for Vec<u8> prints things like [50, 51, 52] so it's not very human-readable.

We could look at pulling in a dependency like bstr, or otherwise adding a custom Debug impl. Let me know what you think and I can either add it onto this PR or do it as a separate one.

ryangjchandler · 2022-09-13T08:34:08Z

Wow, appreciate the work on this @edsrzf.

I definitely think a custom Debug impl would be useful, for convenience.

edsrzf · 2022-09-13T08:31:37Z

trunk_lexer/src/lexer.rs

 }

- pub fn tokenize(&mut self, input: &str) -> Result<Vec<Token>, LexerError> {
+ pub fn tokenize<B: ?Sized + AsRef<[u8]>>(


Making this generic means it can now accept strings or byte slices.

edsrzf · 2022-09-13T08:36:23Z

trunk_lexer/src/lexer.rs

 };
 }

+ fn var<B: ?Sized + AsRef<[u8]>>(v: &B) -> TokenKind {


I found it was actually easier to write this as a generic function than a macro, when it needs to deal with both string and byte literals. Alternatively we could always use byte literals, I guess.

This is fine. The macros were just originally a quick and dirty way of removing repetition, most of them could probably be removed so that the tests are explicit anyway.

edsrzf · 2022-09-13T08:59:40Z

Thoughts on the bstr idea? If I start thinking about how I'd approach a custom Debug impl, I'd probably create a new byte string type with the impl and then have variants like TokenKind::Comment(ByteString) and continue deriving the impl for TokenKind. And then I've basically reimplemented a subset of bstr.

Is there a reason to prefer doing that instead of pulling in bstr?

ryangjchandler · 2022-09-13T09:10:27Z

@edsrzf I'd ideally like to avoid dependencies where possible. Given that this is really just a case of wrapping a byte array / vec, I think we can just handle roll it with a ByteString structure.

trunk_lexer/Cargo.toml

ryangjchandler · 2022-09-13T10:33:46Z

Great work, thanks @edsrzf!

Evan Shaw added 3 commits September 13, 2022 17:01

Lex and parse in terms of bytes rather than characters

4c7047f

Allow lexer to accept bytes as input

f345389

Add a test case for non-UTF-8 variable name

f0fc4f1

edsrzf commented Sep 13, 2022

View reviewed changes

Evan Shaw added 2 commits September 13, 2022 22:03

Add ByteString type

843a00c

Use ByteString through lexer and parser

447b1be

edsrzf commented Sep 13, 2022

View reviewed changes

trunk_lexer/Cargo.toml Outdated Show resolved Hide resolved

Remove serde dependency

d803e53

ryangjchandler approved these changes Sep 13, 2022

View reviewed changes

ryangjchandler merged commit c023468 into php-rust-tools:main Sep 13, 2022

edsrzf deleted the byte-lexer branch September 13, 2022 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lex and parse in terms of bytes rather than characters #57

Lex and parse in terms of bytes rather than characters #57

Uh oh!

edsrzf commented Sep 13, 2022

ryangjchandler commented Sep 13, 2022 •

edited

Loading

edsrzf Sep 13, 2022

edsrzf Sep 13, 2022 •

edited

Loading

ryangjchandler Sep 13, 2022

edsrzf commented Sep 13, 2022

ryangjchandler commented Sep 13, 2022

Uh oh!

ryangjchandler commented Sep 13, 2022

Labels

2 participants

Lex and parse in terms of bytes rather than characters #57

Lex and parse in terms of bytes rather than characters #57

Uh oh!

Conversation

edsrzf commented Sep 13, 2022

ryangjchandler commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

edsrzf Sep 13, 2022

Choose a reason for hiding this comment

edsrzf Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ryangjchandler Sep 13, 2022

Choose a reason for hiding this comment

edsrzf commented Sep 13, 2022

ryangjchandler commented Sep 13, 2022

Uh oh!

ryangjchandler commented Sep 13, 2022

Labels

2 participants

ryangjchandler commented Sep 13, 2022 •

edited

Loading

edsrzf Sep 13, 2022 •

edited

Loading