- Notifications
You must be signed in to change notification settings - Fork 12
Closed
Description
Hi, and thanks for making this!
I noticed that the repo could use a number of stylistic, correctness, and performance improvements. For example:
- The code often performs racy filesystem checks of the form
if path.exists() { do_thing(); }. This is incorrect, as the filesystem is a shared resource, and such code is therefore vulnerable to TOCTOU bugs/attacks. - The code is in general unnecessarily complex at certain points, and contains some pieces of bad practice as well. For example:
- The explicit lifetime in
StaticModel::truncate_str()could be elided. - Again in
truncate_str(), the explicit checking ofs.chars().count() <= max_cis redundant and only causes an unnecessary iteration over the entire string. The function does exactly the same thing without it. - The explicit indexing into the 4-byte or 2-byte chunks during the conversion of the tensor contents to type
f32would be better expressed as<&[u8]>::try_into().unwrap(), which directly produces a[u8; N]. - Due to potential loss of precision,
from-conversion should be preferred overas-conversion for numeric types; thei8-typed path should teherefore bef32::from(b as i8)instead of the currentb as i8 as f32. - There is again unnecessary casting and indexing when retrieving the shape of the tensor. It should simply be:
let &[rows, cols] = tensor.shape().try_into().context("embedding tensor is not a 2D matrix")?;
- Computing the median token length could be done without so many references and explicit dereferencing, and be simplified to just
let median_token_length = lens.get(lens.len() / 2).copied().unwrap_or(1);
- Using
std::env::set_var()is already discouraged in itself (as it's not safe, see the linked docs), but using what is effectively a global variable for communicating between parts within the same program is especially inelegant. - Given that the whole library is about text processing, the humble
tokenargument to theStaticModel::from_pretrained()method has a very non-informative and confusing name. Which token is it? The token to be classified? The start token? Ah, it's the HuggingFace API token! It would be better to call ithf_api_tokeninstead. - It would be better to accept an
AsRef<Path>as the type of therepo_or_pathargument instead of forcing a concrete&str. The reason for this is that&strimplementsAsRef<Path>, but a&Pathcan't easily (and losslessly) be converted to a&strbecause they (paths) aren't necessarily valid UTF-8. So the API would be more flexible and easier to use if it accepted anAsRef<Path>, which includes strings. - Again, deciding whether the input is a filesystem path or a HF repo URL based on the existence of the path on a local drive is at best shady. If someone misspells the path, or tries to access a directory with insufficient permissions etc., s/he will be confused as to why the library suddenly tries to download from HF. There should be an explicit flag/
enumto declare which behavior one wants, or maybe even two entirely separate constructor functions for the two kinds of data sources. - Single-variable expressions in format strings can and should be inlined for better readability. For example:
- .map_err(|e| anyhow!("Failed to load tokenizer: {}", e))?; + .map_err(|e| anyhow!("failed to load tokenizer: {e}"))?;
unk_token, for example.- (ah, yes, and error messages in Rust generally start with a lowercase letter by convention.)
- Since
serde_jsoncan directly work withReaders, there's no need to read entire JSON files in memory before parsing them:- let cfg_bytes = fs::read(&cfg_path).context("Failed to read config.json")?; - let cfg: Value = serde_json::from_slice(&cfg_bytes).context("Failed to parse config.json")?; + let cfg_file = File::open(&cfg_path).context("failed to read config.json")?; + let cfg: Value = serde_json::from_reader(&cfg_file).context("failed to parse config.json")?;
- The explicit lifetime in
- The return type of the embedding methods is
Vec<Vec<f32>>, which in general means many allocations and a cache-unfriendly, discontiguous structure. Since the embeddings always have the same dimension/size,Array2<f32>would be a better return type, so a contiguous chunk of memory is returned and only a single allocation is needed. - The executable command should be in a different crate so the library doesn't need to depend on
clap, which has a pretty large transitive dependency tree that is entirely unnecessary when only the library is needed. - The HF downloading should be behind a separate (but perhaps defaulted) feature, which allows one to skip a large number of further (networking/TLS) dependencies unless needed.
do-mecgisky1980
Metadata
Metadata
Assignees
Labels
No labels