Skip to content

YouvenZ/Structured_output_openai_Research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ OpenAI Structured Output Extractor

Transform unstructured research papers into structured data with the power of OpenAI's GPT-4o-mini and Pydantic models

Python 3.8+ OpenAI API Pydantic License: MIT

πŸ“Ί YouTube Tutorial

This repository accompanies our comprehensive YouTube tutorial on "Extracting Structured Data from Research Papers using OpenAI's Structured Output API".

πŸŽ₯ Watch the full tutorial here | ⭐ Don't forget to subscribe!


🌟 What This Project Does

Ever struggled with extracting meaningful data from academic papers? This project demonstrates how to:

  • βœ… Parse unstructured research papers (Markdown format)
  • βœ… Extract structured metadata using OpenAI's latest structured output API
  • βœ… Validate data with Pydantic models
  • βœ… Export results to CSV format for further analysis
  • βœ… Handle complex nested data structures

πŸ—οΈ Project Structure

πŸ“ OpenAI_API_structured_output/ β”œβ”€β”€ πŸ“„ main.py # Main extraction script β”œβ”€β”€ πŸ“„ paper.md # Sample research paper β”œβ”€β”€ πŸ“„ research_extraction_papers.csv # Generated output β”œβ”€β”€ πŸ“„ .env # Environment variables β”œβ”€β”€ πŸ“„ requirements.txt # Dependencies └── πŸ“„ README.md # This file 

🚦 Quick Start

Prerequisites

1️⃣ Clone the Repository

git clone https://github.com/yourusername/OpenAI_API_structured_output.git cd OpenAI_API_structured_output

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Configure Environment

Create a .env file in the project root:

OPENAI_API_KEY="your_actual_api_key_here"

4️⃣ Run the Extraction

python main.py

πŸ”§ Core Components

πŸ“‹ Data Models

Our Pydantic models ensure type safety and data validation:

class Abstract(BaseModel): summary: str = Field(description="3 lines summary of the paper.") keywords: list[str] = Field(description="list of keywords") conclusion: str = Field(description="Conclusion made by the authors, in 2 lines.") class ResearchPaper(BaseModel): title: str = Field(description="Title of the papers") authors: list[str] = Field(description="List of the authors in the paper") publication_year: str = Field(description="The year of publication") research_field: str = Field(description="The research field of the paper. In one word") # ... and more fields

πŸ€– Key Features

Feature Description Status
Structured Output API Uses OpenAI's latest JSON schema feature βœ…
Type Validation Pydantic models ensure data integrity βœ…
CSV Export Automatic export to structured format βœ…
Error Handling Robust error handling for API calls βœ…
Confidence Scoring AI-generated confidence scores (0-1) βœ…

πŸ“Š Sample Output

The script processes research papers and extracts:

  • πŸ“ Title & Authors
  • πŸ“… Publication Details (Year, Journal, DOI)
  • πŸ”¬ Research Field classification
  • 🎯 Key Findings (bulleted list)
  • πŸ”— Code Availability (links if present)
  • πŸ“ˆ Confidence Score (AI-generated)
  • πŸ“‹ Structured Abstract (summary, keywords, conclusion)

Example Console Output:

Response title: Quantum-Enhanced Machine Learning for Real-Time Financial Risk Assessment ================================================================================ Response authors: ['Dr. Alexandra Kim, MIT Computer Science Department', 'Prof. James Rodriguez, Stanford Financial Engineering', 'Dr. Sarah Chen, Google DeepMind Research'] ================================================================================ Response publication_year: 2024 ================================================================================ Response research_field: Finance ================================================================================ Response confidence_score: 0.95 

πŸ› οΈ Customization

Adding New Fields

  1. Extend the ResearchPaper model:
class ResearchPaper(BaseModel): # existing fields... new_field: str = Field(description="Your new field description")
  1. The OpenAI API will automatically extract the new field based on your description.

Changing Models

Replace "gpt-4o-mini" in the structured_request function:

completion = client.chat.completions.create( model="gpt-4o", # or any other supported model # ... )

🎯 Use Cases

This project is perfect for:

  • πŸ“š Academic Research - Systematically catalog research papers
  • 🏒 Corporate R&D - Track industry research trends
  • πŸ“Š Data Analysis - Build research databases
  • πŸ€– AI Training - Generate structured datasets
  • πŸ“– Literature Reviews - Automate paper summarization

🚨 Important Notes

⚠️ API Costs: This project uses OpenAI's paid API. Monitor your usage to avoid unexpected charges.

πŸ”’ Security: Never commit your actual API key to version control. Always use environment variables.

πŸ“ Token Limits: Large papers may exceed token limits. Consider splitting very long documents.

🀝 Contributing

Found a bug or have a suggestion? We'd love your input!

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
  3. πŸ’Ύ Commit your changes (git commit -m 'Add some AmazingFeature')
  4. πŸ“€ Push to the branch (git push origin feature/AmazingFeature)
  5. πŸ”„ Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • πŸ€– OpenAI for the incredible structured output API
  • 🐍 Pydantic team for fantastic data validation
  • πŸ“Š Pandas for seamless data manipulation
  • πŸŽ₯ Our YouTube subscribers for the amazing support!

🌟 If this project helped you, please give it a star! 🌟

πŸ“Ί Subscribe to our YouTube channel for more AI tutorials!


About

Tutorial for extractiong information from research paper in systematic way using the chatgpt openAI API.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages