Transform unstructured research papers into structured data with the power of OpenAI's GPT-4o-mini and Pydantic models
This repository accompanies our comprehensive YouTube tutorial on "Extracting Structured Data from Research Papers using OpenAI's Structured Output API".
π₯ Watch the full tutorial here | β Don't forget to subscribe!
Ever struggled with extracting meaningful data from academic papers? This project demonstrates how to:
- β Parse unstructured research papers (Markdown format)
- β Extract structured metadata using OpenAI's latest structured output API
- β Validate data with Pydantic models
- β Export results to CSV format for further analysis
- β Handle complex nested data structures
π OpenAI_API_structured_output/ βββ π main.py # Main extraction script βββ π paper.md # Sample research paper βββ π research_extraction_papers.csv # Generated output βββ π .env # Environment variables βββ π requirements.txt # Dependencies βββ π README.md # This file - Python 3.8 or higher
- OpenAI API key (Get yours here)
git clone https://github.com/yourusername/OpenAI_API_structured_output.git cd OpenAI_API_structured_outputpip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY="your_actual_api_key_here"python main.pyOur Pydantic models ensure type safety and data validation:
class Abstract(BaseModel): summary: str = Field(description="3 lines summary of the paper.") keywords: list[str] = Field(description="list of keywords") conclusion: str = Field(description="Conclusion made by the authors, in 2 lines.") class ResearchPaper(BaseModel): title: str = Field(description="Title of the papers") authors: list[str] = Field(description="List of the authors in the paper") publication_year: str = Field(description="The year of publication") research_field: str = Field(description="The research field of the paper. In one word") # ... and more fields| Feature | Description | Status |
|---|---|---|
| Structured Output API | Uses OpenAI's latest JSON schema feature | β |
| Type Validation | Pydantic models ensure data integrity | β |
| CSV Export | Automatic export to structured format | β |
| Error Handling | Robust error handling for API calls | β |
| Confidence Scoring | AI-generated confidence scores (0-1) | β |
The script processes research papers and extracts:
- π Title & Authors
- π Publication Details (Year, Journal, DOI)
- π¬ Research Field classification
- π― Key Findings (bulleted list)
- π Code Availability (links if present)
- π Confidence Score (AI-generated)
- π Structured Abstract (summary, keywords, conclusion)
Response title: Quantum-Enhanced Machine Learning for Real-Time Financial Risk Assessment ================================================================================ Response authors: ['Dr. Alexandra Kim, MIT Computer Science Department', 'Prof. James Rodriguez, Stanford Financial Engineering', 'Dr. Sarah Chen, Google DeepMind Research'] ================================================================================ Response publication_year: 2024 ================================================================================ Response research_field: Finance ================================================================================ Response confidence_score: 0.95 - Extend the
ResearchPapermodel:
class ResearchPaper(BaseModel): # existing fields... new_field: str = Field(description="Your new field description")- The OpenAI API will automatically extract the new field based on your description.
Replace "gpt-4o-mini" in the structured_request function:
completion = client.chat.completions.create( model="gpt-4o", # or any other supported model # ... )This project is perfect for:
- π Academic Research - Systematically catalog research papers
- π’ Corporate R&D - Track industry research trends
- π Data Analysis - Build research databases
- π€ AI Training - Generate structured datasets
- π Literature Reviews - Automate paper summarization
β οΈ API Costs: This project uses OpenAI's paid API. Monitor your usage to avoid unexpected charges.
π Security: Never commit your actual API key to version control. Always use environment variables.
π Token Limits: Large papers may exceed token limits. Consider splitting very long documents.
Found a bug or have a suggestion? We'd love your input!
- π΄ Fork the repository
- πΏ Create a feature branch (
git checkout -b feature/AmazingFeature) - πΎ Commit your changes (
git commit -m 'Add some AmazingFeature') - π€ Push to the branch (
git push origin feature/AmazingFeature) - π Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- π€ OpenAI for the incredible structured output API
- π Pydantic team for fantastic data validation
- π Pandas for seamless data manipulation
- π₯ Our YouTube subscribers for the amazing support!
π If this project helped you, please give it a star! π
πΊ Subscribe to our YouTube channel for more AI tutorials!