One day i needed a solution that could parse meta graph tags from a input line and produce a title and an icon
Of course there were an infinite number of libraries that used jsoup's, but that was not what i needed, i wanted to use qt and c++
I thought as soon i enter my query - "c++ parser meta tags" i will see all the solutions o was looking for
But in reality, everything is a little more complicated.
What i did:
1) Prepare step, parse the input and decide if there is a valid url or just text
This step seems to be expensive (not so much as loading everything from input,
but I think it is too extra work)
static bool checkIsContainsHyperlink(QString line) { static QRegularExpression regex(web_pattern); QRegularExpressionMatch match = regex.match(line); return match.hasMatch(); } 2) Download with the ability to handle redirects
Many sites do not provide tags on simple web pages, and they often use redirect for reasons which i don't know
connect(&m_WebCtrl, SIGNAL (finished(QNetworkReply*)), this, SLOT (fileDownloaded(QNetworkReply*))); QNetworkRequest request(url); request.setAttribute(QNetworkRequest::RedirectPolicyAttribute, true); m_WebCtrl.get(request); 3) Saving the page we downloaded it seems strange, why we save this page is probably surprising you
The problem is that some sites can ban a specific IP, which makes a lot of requests
For me it was enough to change 3-5 symbols in the url line and i got banned for a few minutes
Caching downloaded pages solved this problem
connect(m_downloader_image, &FileDownloader::downloaded, [&, imagePathName]() { QByteArray array = m_downloader_image->downloadedData(); if(!array.isEmpty()) { QFile imageFile(imagePathName); if(imageFile.open(QIODevice::WriteOnly)) { imageFile.write(array); m_result.og_image_local_path = imagePathName; } } emit signalParserDone(m_result); }); 4) Parsing
So we have a web-page in the local folder, it's time to parse it and get what we need
Unfortunately, for me, gumbo-parser turned out to be very unfriendly
So for first start i decided to use regex, hoping to change it to something else in the future
QRegularExpression site_name_regex(og_site_name); QRegularExpression title_regex(og_title); QRegularExpression description_regex(og_description); QRegularExpression url_regex(og_url); QRegularExpression image_regex(og_image); QRegularExpressionMatch match; match = site_name_regex.match(html); if (match.hasMatch()) { res.og_site_name = match.captured(1); } match = title_regex.match(html); if (match.hasMatch()) { res.og_title = match.captured(1); } match = description_regex.match(html); if (match.hasMatch()) { res.og_description = match.captured(1); } match = url_regex.match(html); if (match.hasMatch()) { res.og_url = match.captured(1); } match = image_regex.match(html); if (match.hasMatch()) { res.og_image = match.captured(1); } Finally, we can enter URL-address and enjoy the preview and title

Top comments (0)