本文主要研究一下Spring AI Alibaba的BilibiliDocumentReader

BilibiliDocumentReader

community/document-readers/spring-ai-alibaba-starter-document-reader-bilibili/src/main/java/com/alibaba/cloud/ai/reader/bilibili/BilibiliDocumentReader.java

public class BilibiliDocumentReader implements DocumentReader { private static final Logger logger = LoggerFactory.getLogger(BilibiliDocumentReader.class); private static final String API_BASE_URL = "https://api.bilibili.com/x/web-interface/view?bvid="; private final String resourcePath; private final ObjectMapper objectMapper; private static final int MEMORY_SIZE = 5; private static final int BYTE_SIZE = 1024; private static final int MAX_MEMORY_SIZE = MEMORY_SIZE * BYTE_SIZE * BYTE_SIZE; private static final WebClient WEB_CLIENT = WebClient.builder() .defaultHeader(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE) .codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(MAX_MEMORY_SIZE)) .build(); public BilibiliDocumentReader(String resourcePath) { Assert.hasText(resourcePath, "Query string must not be empty"); this.resourcePath = resourcePath; this.objectMapper = new ObjectMapper(); } @Override public List<Document> get() { List<Document> documents = new ArrayList<>(); try { String bvid = extractBvid(resourcePath); String videoInfoResponse = fetchVideoInfo(bvid); JsonNode videoData = parseJson(videoInfoResponse).path("data"); String title = videoData.path("title").asText(); String description = videoData.path("desc").asText(); Document infoDoc = new Document("Video information", Map.of("title", title, "description", description)); documents.add(infoDoc); String documentContent = fetchAndProcessSubtitles(videoData, title, description); documents.add(new Document(documentContent)); } catch (IllegalArgumentException e) { logger.error("Invalid input: {}", e.getMessage()); documents.add(new Document("Error: Invalid input")); } catch (IOException e) { logger.error("Error parsing JSON: {}", e.getMessage(), e); documents.add(new Document("Error parsing JSON: " + e.getMessage())); } catch (Exception e) { logger.error("Unexpected error: {}", e.getMessage(), e); documents.add(new Document("Unexpected error: " + e.getMessage())); } return documents; } private String extractBvid(String resourcePath) { return resourcePath.replaceAll(".*(BV\\w+).*", "$1"); } private String fetchVideoInfo(String bvid) { return WEB_CLIENT.get().uri(API_BASE_URL + bvid).retrieve().bodyToMono(String.class).block(); } private JsonNode parseJson(String jsonResponse) throws IOException { return objectMapper.readTree(jsonResponse); } private String fetchAndProcessSubtitles(JsonNode videoData, String title, String description) throws IOException { JsonNode subtitleList = videoData.path("subtitle").path("list"); if (subtitleList.isArray() && subtitleList.size() > 0) { String subtitleUrl = subtitleList.get(0).path("subtitle_url").asText(); String subtitleResponse = WEB_CLIENT.get().uri(subtitleUrl).retrieve().bodyToMono(String.class).block(); JsonNode subtitleJson = parseJson(subtitleResponse); StringBuilder rawTranscript = new StringBuilder(); subtitleJson.path("body").forEach(node -> rawTranscript.append(node.path("content").asText()).append(" ")); return String.format("Video Title: %s, Description: %s\nTranscript: %s", title, description, rawTranscript.toString().trim()); } else { return String.format("No subtitles found for video: %s. Returning an empty transcript.", resourcePath); } } }
BilibiliDocumentReader使用WebClient去请求B站接口,它从url解析bvid,再根据bvid去请求接口,解析json获取title、description,通过fetchAndProcessSubtitles再去请求subtitle_url获取字幕内容作为document的内容

示例

public class BilibiliDocumentReaderTest { private static final Logger logger = LoggerFactory.getLogger(BilibiliDocumentReader.class); @Test void bilibiliDocumentReaderTest() { BilibiliDocumentReader bilibiliDocumentReader = new BilibiliDocumentReader( "https://www.bilibili.com/video/BV1KMwgeKECx/?t=7&vd_source=3069f51b168ac07a9e3c4ba94ae26af5"); List<Document> documents = bilibiliDocumentReader.get(); logger.info("documents: {}", documents); } }

小结

spring-ai-alibaba-starter-document-reader-bilibili提供了BilibiliDocumentReader用于解析B站的视频url,它请求两次接口,一次获取title和description,一次获取字幕。

doc


codecraft
11.9k 声望2k 粉丝

当一个代码的工匠回首往事时,不因虚度年华而悔恨,也不因碌碌无为而羞愧,这样,当他老的时候,可以很自豪告诉世人,我曾经将代码注入生命去打造互联网的浪潮之巅,那是个很疯狂的时代,我在一波波的浪潮上留下...