Skip to content

Code quality, design, and organizational issues #14

@H2CO3

Description

@H2CO3

Hi, and thanks for making this!

I noticed that the repo could use a number of stylistic, correctness, and performance improvements. For example:

  • The code often performs racy filesystem checks of the form if path.exists() { do_thing(); }. This is incorrect, as the filesystem is a shared resource, and such code is therefore vulnerable to TOCTOU bugs/attacks.
  • The code is in general unnecessarily complex at certain points, and contains some pieces of bad practice as well. For example:
    • The explicit lifetime in StaticModel::truncate_str() could be elided.
    • Again in truncate_str(), the explicit checking of s.chars().count() <= max_c is redundant and only causes an unnecessary iteration over the entire string. The function does exactly the same thing without it.
    • The explicit indexing into the 4-byte or 2-byte chunks during the conversion of the tensor contents to type f32 would be better expressed as <&[u8]>::try_into().unwrap(), which directly produces a [u8; N].
    • Due to potential loss of precision, from-conversion should be preferred over as-conversion for numeric types; the i8-typed path should teherefore be f32::from(b as i8) instead of the current b as i8 as f32.
    • There is again unnecessary casting and indexing when retrieving the shape of the tensor. It should simply be:
      let &[rows, cols] = tensor.shape().try_into().context("embedding tensor is not a 2D matrix")?;
    • Computing the median token length could be done without so many references and explicit dereferencing, and be simplified to just
      let median_token_length = lens.get(lens.len() / 2).copied().unwrap_or(1);
    • Using std::env::set_var() is already discouraged in itself (as it's not safe, see the linked docs), but using what is effectively a global variable for communicating between parts within the same program is especially inelegant.
    • Given that the whole library is about text processing, the humble token argument to the StaticModel::from_pretrained() method has a very non-informative and confusing name. Which token is it? The token to be classified? The start token? Ah, it's the HuggingFace API token! It would be better to call it hf_api_token instead.
    • It would be better to accept an AsRef<Path> as the type of the repo_or_path argument instead of forcing a concrete &str. The reason for this is that &str implements AsRef<Path>, but a &Path can't easily (and losslessly) be converted to a &str because they (paths) aren't necessarily valid UTF-8. So the API would be more flexible and easier to use if it accepted an AsRef<Path>, which includes strings.
    • Again, deciding whether the input is a filesystem path or a HF repo URL based on the existence of the path on a local drive is at best shady. If someone misspells the path, or tries to access a directory with insufficient permissions etc., s/he will be confused as to why the library suddenly tries to download from HF. There should be an explicit flag/enum to declare which behavior one wants, or maybe even two entirely separate constructor functions for the two kinds of data sources.
    • Single-variable expressions in format strings can and should be inlined for better readability. For example:
      - .map_err(|e| anyhow!("Failed to load tokenizer: {}", e))?; + .map_err(|e| anyhow!("failed to load tokenizer: {e}"))?;
    and the same applies to the error message containing unk_token, for example.
    • (ah, yes, and error messages in Rust generally start with a lowercase letter by convention.)
    • Since serde_json can directly work with Readers, there's no need to read entire JSON files in memory before parsing them:
      - let cfg_bytes = fs::read(&cfg_path).context("Failed to read config.json")?; - let cfg: Value = serde_json::from_slice(&cfg_bytes).context("Failed to parse config.json")?; + let cfg_file = File::open(&cfg_path).context("failed to read config.json")?; + let cfg: Value = serde_json::from_reader(&cfg_file).context("failed to parse config.json")?;
  • The return type of the embedding methods is Vec<Vec<f32>>, which in general means many allocations and a cache-unfriendly, discontiguous structure. Since the embeddings always have the same dimension/size, Array2<f32> would be a better return type, so a contiguous chunk of memory is returned and only a single allocation is needed.
  • The executable command should be in a different crate so the library doesn't need to depend on clap, which has a pretty large transitive dependency tree that is entirely unnecessary when only the library is needed.
  • The HF downloading should be behind a separate (but perhaps defaulted) feature, which allows one to skip a large number of further (networking/TLS) dependencies unless needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions