Code quality, design, and organizational issues

Hi, and thanks for making this!

I noticed that the repo could use a number of stylistic, correctness, and performance improvements. For example:

The code often performs racy filesystem checks of the form if path.exists() { do_thing(); }. This is incorrect, as the filesystem is a shared resource, and such code is therefore vulnerable to TOCTOU bugs/attacks.
The code is in general unnecessarily complex at certain points, and contains some pieces of bad practice as well. For example:
- The explicit lifetime in StaticModel::truncate_str() could be elided.
- Again in truncate_str(), the explicit checking of s.chars().count() <= max_c is redundant and only causes an unnecessary iteration over the entire string. The function does exactly the same thing without it.
- The explicit indexing into the 4-byte or 2-byte chunks during the conversion of the tensor contents to type f32 would be better expressed as <&[u8]>::try_into().unwrap(), which directly produces a [u8; N].
- Due to potential loss of precision, from-conversion should be preferred over as-conversion for numeric types; the i8-typed path should teherefore be f32::from(b as i8) instead of the current b as i8 as f32.
- There is again unnecessary casting and indexing when retrieving the shape of the tensor. It should simply be:
```
let &[rows, cols] = tensor.shape().try_into().context("embedding tensor is not a 2D matrix")?;
```
- Computing the median token length could be done without so many references and explicit dereferencing, and be simplified to just
```
let median_token_length = lens.get(lens.len() / 2).copied().unwrap_or(1);
```
- Using std::env::set_var() is already discouraged in itself (as it's not safe, see the linked docs), but using what is effectively a global variable for communicating between parts within the same program is especially inelegant.
- Given that the whole library is about text processing, the humble token argument to the StaticModel::from_pretrained() method has a very non-informative and confusing name. Which token is it? The token to be classified? The start token? Ah, it's the HuggingFace API token! It would be better to call it hf_api_token instead.
- It would be better to accept an AsRef<Path> as the type of the repo_or_path argument instead of forcing a concrete &str. The reason for this is that &str implements AsRef<Path>, but a &Path can't easily (and losslessly) be converted to a &str because they (paths) aren't necessarily valid UTF-8. So the API would be more flexible and easier to use if it accepted an AsRef<Path>, which includes strings.
- Again, deciding whether the input is a filesystem path or a HF repo URL based on the existence of the path on a local drive is at best shady. If someone misspells the path, or tries to access a directory with insufficient permissions etc., s/he will be confused as to why the library suddenly tries to download from HF. There should be an explicit flag/enum to declare which behavior one wants, or maybe even two entirely separate constructor functions for the two kinds of data sources.
- Single-variable expressions in format strings can and should be inlined for better readability. For example:
```
- .map_err(|e| anyhow!("Failed to load tokenizer: {}", e))?; + .map_err(|e| anyhow!("failed to load tokenizer: {e}"))?;
```
and the same applies to the error message containing unk_token, for example.
- (ah, yes, and error messages in Rust generally start with a lowercase letter by convention.)
- Since serde_json can directly work with Readers, there's no need to read entire JSON files in memory before parsing them:
```
- let cfg_bytes = fs::read(&cfg_path).context("Failed to read config.json")?; - let cfg: Value = serde_json::from_slice(&cfg_bytes).context("Failed to parse config.json")?; + let cfg_file = File::open(&cfg_path).context("failed to read config.json")?; + let cfg: Value = serde_json::from_reader(&cfg_file).context("failed to parse config.json")?;
```
The return type of the embedding methods is Vec<Vec<f32>>, which in general means many allocations and a cache-unfriendly, discontiguous structure. Since the embeddings always have the same dimension/size, Array2<f32> would be a better return type, so a contiguous chunk of memory is returned and only a single allocation is needed.
The executable command should be in a different crate so the library doesn't need to depend on clap, which has a pretty large transitive dependency tree that is entirely unnecessary when only the library is needed.
The HF downloading should be behind a separate (but perhaps defaulted) feature, which allows one to skip a large number of further (networking/TLS) dependencies unless needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code quality, design, and organizational issues #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code quality, design, and organizational issues #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions