Chrome Extension - Interactive Scraper

This is my attempt to write an Interactive Scraper.


See the Robot Icon In Browser Action Icons

How to use?

1. From browser, you can choose the DOM element to scraper time to time. 
Click On Capture detail
2. Data is persisted in Indexed DB of Background Page. 
3. Data can be viewed or deleted by clicking "View" link from Add-In. This is implemented using Data Tables (table plug-in for jQuery )

4. Once configured, you can run the scraper by clicking on "Roll" button. It will take you through all the URL's and collect text from DOM path and capture a screen shot for your future reference. 




Development Challenges: 

I have faced a few development challenges while developing this. 

1. Chrome Synchronous Message Passing: I faced this problem when trying to load the URL's in a tab one after another. After the URL is loaded, the scraper should collect the text from the DOM and update it in database. There are many answers available in the internet. Nothing has solved my problem. The problem was 

Port error: Could not establish connection. Receiving end does not exist. 

I am using chrome.tabs.onUpdated.addListener to understand if the the new URL is loaded. Once the page is loaded, scroll to the DOM element, highlight it and extract text. Final step is to send the collected text back to background page to store it in indexedDB. But, there was no easy way to control. I used chrome.tabs.executeScript to scroll to DOM element and collect the text. 

I tried the following approaches. 

1. In the callback of chrome.tabs.executeScriptcall chrome.tabs.sendMessage to request the content script to send the collected text back to background page. It didn't work as the content script's listener is not ready even though the page is loaded. 

2. In chrome.tabs.onUpdated.addListener once the page is loadedcall chrome.tabs.sendMessage in one of the call backs to request the content script to send the collected text back to background page.

3. Use results of chrome.tabs.executeScript capture the collected text. It didn't work as the page is moving on to next URL before collecting text. 

Finally, I encountered the following post on Stack Overflow.  

http://stackoverflow.com/questions/8859622/chrome-extension-how-to-detect-that-content-script-is-already-loaded-into-a-tab

Using the above technique, we could only initiate our actions on the loaded page only after the content script is loaded. I used the same variable to make this whole process synchronous. 

1. Once the tab is loaded use chrome.tabs.executeScript and execute the following 

script chrome.tabs.executeScript(tabid, {
code: "var EnhanceLibIsLoaded=false;chrome.extension.sendMessage({ loaded: EnhanceLibIsLoaded || false });"
});

2. In chrome.runtime.onMessage.addListener of background page, you will receive request.loaded. This is the proof that content script is loaded. 

if request.loaded is false, collect the text, set EnhanceLibIsLoaded to true and call chrome.tabs.update.

Yes, you can't do synchronous messaging in chrome extensions. But, using this approach you can control the flow and achieve the synchronous flow of data. 

2. Persisting Data in Chrome Extension Using Indexed DB:
As long as the extension is active, background page retains its contents. So, I used Indexed DB to store the data persistently in chrome extension. The tricky part of indexedDB is to keeping transactions active. 

Following post is really helpful. http://stackoverflow.com/questions/10385364/how-do-you-keep-an-indexeddb-transaction-alive

3. Data Tables:  Please see the code for more details (dTable.cs). 

Source code can be found at https://github.com/pradeepkumargali/Interactive-Scraper 

Comments

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Software

    Data Scraping Tools

    ReplyDelete

Post a Comment

Feedback - positive or negative is welcome.

Popular posts from this blog

How to prepare your LOB app for Intune?

Information Architecture - Setup your term store to scale

Generate token signing .CER from ADFS Federation Metadata XML