Quickstart Guide¶
Get started with Deep Lake by following these examples.
Installation¶
Deep Lake can be installed using pip:
Creating a Dataset¶
import deeplake # Create a local dataset ds = deeplake.create("path/to/dataset") # Or create in cloud storage ds = deeplake.create("s3://my-bucket/dataset") ds = deeplake.create("gcs://my-bucket/dataset") ds = deeplake.create("azure://container/dataset") Adding Data¶
Add columns to store different types of data:
# Add basic data types ds.add_column("ids", "int32") ds.add_column("labels", "text") # Add specialized data types ds.add_column("images", deeplake.types.Image()) ds.add_column("videos", deeplake.types.Video()) ds.add_column("embeddings", deeplake.types.Embedding(768)) ds.add_column("masks", deeplake.types.BinaryMask()) Insert data into the dataset:
# Add single samples ds.append([{ "ids": 1, "labels": "cat", "images": image_array, "videos": video_bytes, "embeddings": embedding_vector, "masks": mask_array }]) # Add batches of data ds.append({ "ids": [1, 2, 3], "labels": ["cat", "dog", "bird"], "images": batch_of_images, "videos": batch_of_videos, "embeddings": batch_of_embeddings, "masks": batch_of_masks }) ds.commit() # Commit changes to the storage Accessing Data¶
Access individual samples:
# Get single items image = ds["images"][0] label = ds["labels"][0] embedding = ds["embeddings"][0] # Get ranges images = ds["images"][0:100] labels = ds["labels"][0:100] # Get specific indices selected_images = ds["images"][[0, 2, 3]] Vector Search¶
Search by embedding similarity:
# Find similar items text_vector = ','.join(str(x) for x in search_vector) results = ds.query(f""" SELECT * ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC LIMIT 100 """) # Process results - Method 1: iterate through items for item in results: image = item["images"] label = item["labels"] # Process results - Method 2: direct column access images = results["images"][:] labels = results["labels"][:] # Recommended for better performance Data Versioning¶
# Commit changes ds.commit("Added initial data") # Create version tag ds.tag("v1.0") # View history for version in ds.history: print(version.id, version.message) # Create a new branch ds.branch("new-branch") ### Add new data to the branch ... main_ds = ds.branches['main'].open() main_ds.merge("new-branch") Async Operations¶
Use async operations for better performance:
# Async data loading future = ds["images"].get_async(slice(0, 1000)) images = future.result() # Async query future = ds.query_async( "SELECT * WHERE labels = 'cat'" ) cats = future.result() Next Steps¶
- Explore RAG applications
- Check out Deep Learning integration
Support¶
If you encounter any issues:
- Check our GitHub Issues
- Join our Slack Community