Chrome Extension - Interactive Scraper

July 17, 2013

This is my attempt to write an Interactive Scraper.

See the Robot Icon In Browser Action Icons

How to use?

1. From browser, you can choose the DOM element to scraper time to time. 
Click On Capture detail
2. Data is persisted in Indexed DB of Background Page. 
3. Data can be viewed or deleted by clicking "View" link from Add-In. This is implemented using Data Tables (table plug-in for jQuery )

4. Once configured, you can run the scraper by clicking on "Roll" button. It will take you through all the URL's and collect text from DOM path and capture a screen shot for your future reference. 

Development Challenges: 

I have faced a few development challenges while developing this. 

1. Chrome Synchronous Message Passing: I faced this problem when trying to load the URL's in a tab one after another. After the URL is loaded, the scraper should collect the text from the DOM and update it in database. There are many answers available in the internet. Nothing has solved my problem. The problem was 

Port error: Could not establish connection. Receiving end does not exist. 

I am using chrome.tabs.onUpdated.addListener to understand if the the new URL is loaded. Once the page is loaded, scroll to the DOM element, highlight it and extract text. Final step is to send the collected text back to background page to store it in indexedDB. But, there was no easy way to control. I used chrome.tabs.executeScript to scroll to DOM element and collect the text. 

I tried the following approaches. 

1. In the callback of chrome.tabs.executeScriptcall chrome.tabs.sendMessage to request the content script to send the collected text back to background page. It didn't work as the content script's listener is not ready even though the page is loaded. 

2. In chrome.tabs.onUpdated.addListener once the page is loadedcall chrome.tabs.sendMessage in one of the call backs to request the content script to send the collected text back to background page.

3. Use results of chrome.tabs.executeScript capture the collected text. It didn't work as the page is moving on to next URL before collecting text. 

Finally, I encountered the following post on Stack Overflow.

Using the above technique, we could only initiate our actions on the loaded page only after the content script is loaded. I used the same variable to make this whole process synchronous. 

1. Once the tab is loaded use chrome.tabs.executeScript and execute the following 

script chrome.tabs.executeScript(tabid, {
code: "var EnhanceLibIsLoaded=false;chrome.extension.sendMessage({ loaded: EnhanceLibIsLoaded || false });"

2. In chrome.runtime.onMessage.addListener of background page, you will receive request.loaded. This is the proof that content script is loaded. 

if request.loaded is false, collect the text, set EnhanceLibIsLoaded to true and call chrome.tabs.update.

Yes, you can't do synchronous messaging in chrome extensions. But, using this approach you can control the flow and achieve the synchronous flow of data. 

2. Persisting Data in Chrome Extension Using Indexed DB:
As long as the extension is active, background page retains its contents. So, I used Indexed DB to store the data persistently in chrome extension. The tricky part of indexedDB is to keeping transactions active. 

Following post is really helpful.

3. Data Tables:  Please see the code for more details (dTable.cs). 

Source code can be found at 


Native Vs Hybrid

July 09, 2013

There are three ways to deliver an application on Mobiles –
1. Native – Built on Native Mobile Platform
2. HTML 5 – Works in browser
3. Hybrid – Built with HTML5, Can access native features
It is clear that user spend 80% of their time using Apps on mobile. That is four times more than the time they spend in browsing. Hence, browser based HTML 5 application is not the preferred way to deliver your mobile application. We have two options left before us. One, build a native application using the mobile OS platform (ex: Android or iOS). Second, use your HTML5 skills to quickly build an application and make it work as a native app on all platform using build frameworks such as PhoneGap.
Many have tried to compare and contrast Native and Hybrid. In this article I will summarize the analysis I have done over this subject matter.  After going through lot of articles and presentations, a presentation from propelics appealed to me. It is unbiased and to the point. Below are the pros and cons for each approach.
Cross platform deployment cost
Update Speed
Pixel Perfect UX
Availability of Skill Sets
APIs to Platform Specific Features
Support Multiple Platforms
Extensive Offline Support
Quicker Learning Curve
App Monetization
No App Store Approvals Needed
Inconsistent Browser Support

Mix Web Code with Native Wrapper
Performance – Browser Dependent
Less Code to Support Multi-Platform
Requires More Specialized Dev. Skills
App Store Experience
Native Apps that Don’t Look Native
APIs to Access Device Features
Risk of App Store Rejection
Make Changes w/o Resubmit to App Store
Extensive Offline Support

Inconsistent Browser Support

Rich User Interface
More difficult to Support / Maintain
Best App Performance
Increased Time to Update/ Distribute
Most Secure
Skills Can be Hard to Find
Hooks into all the device APIs
Limited App Portability

General verdict on HYBRID vs. NATIVE is – “Let your business requirement drive the decision”. I completely agree with the conclusion. But, some of the above points are not so evident. I would like to elaborate on some of the above aspects based on my research and experience.
One misconception about Hybrid platform is “Write Once – Deploy Everywhere”.  But, it is not completely true. In Hybrid platform, you are developing an HTML5 app (using CSS and JavaScript). Then the website wrapped using Native API with Hybrid mobile development platform (such as PhoneGap) for a specific device. The UI of the app is nothing but a web browser instance (For Apple devices, it’s Safari) with 100% height and width. And, JavaScript is used to access device specific features.
An important point to note here is, in Hybrid we are trying to make a website look like a native application with the help of a third party framework. Thus, we can carry Model and Controller part of the code to other devices. But, View should be specific to devices. If developing an application for Android, the web page should be able to mimic native UI features such as animations, scroll bounce, hide scrollbars, buttons, lists etc. (See case study of Exfm). I think this is one of the reasons why most of the PhoneGap applications are released on one or two platforms only.
We have one webpage and DOM of the page is manipulated using JavaScript. Of course, we have frameworks to manage this part.
Bootstrap.js – Twitter JS Library, for fluid UI across platforms. It doesn’t quite work for complex applications
Jquery Mobile – Great DOM manipulation library, easy to learn, works great for simple demos and performance of animations is an issue. Recommended for simple projects
Snecha Touch – Based on ExtJs framework. Good animations. Learning takes some effort. It’s recommended for real projects.
Lungo -
Controller and Model
Backbone.js –gives structure to web applications by providing models with key-value binding and custom events, collections with a rich API of enumerable functions, views with declarative event handling, and connects it all to your existing API over a RESTful JSON interface.
Angular.js – MVC frame work by Google
Ember.js - Ember is a JavaScript framework for creating ambitious web applications that eliminates boilerplate and provides standard application architecture.
KnockOut.js - Simplify dynamic JavaScript UIs by applying the Model-View-View Model (MVVM) pattern
Look at to download a sample TODO list app across all possible MV* platforms.
A decent comparison and factors to consider for choosing a framework can be found here.
If you have already noticed by now, JavaScript is the main ingredient for Hybrid app development.  And, you need specialized development skill to make the app look and behave like native one.
With all great power comes great responsibility. We can’t develop any HTML app and expect it to work on mobile smoothly. Generally, Hybrid apps need more response time than Native counterparts as their calls have to pass through an intermediate layer (PhonGap). Below articles give some tips and tricks to build better performing apps.
*People generally refer to Weinre for debugging PhoneGap applications. Complete list of debugging tools can be found here.
Security & Offline Support:
PhoneGap has a very good write up about security for each platform.  Reverse engineering is a concern for many people since once can simple open the binary and look at JavaScript code. PhoneGap recommends downloading JavaScript at runtime. Some people felt that App store approval may become difficult task as we are loading JS at runtime, if we follow this recommendation.
Offline file security is mostly device driven. And, it has little to do with whether you are developing a hybrid or native application. Recommended defenses are
1.       Lock screen
2.       Internal Storage
3.       Full – Disk Encryption
After applying all the above, the mobile data is still vulnerable. Please refer this talk from AnDevCon IV for all offences and defenses.  There is a need for a secure data store on Mobile (SQL Cipher?).
Tom Gersic advice goes like this – “If you know what the devise you are going with, if you are rolling out a 1000 IPADS in the enterprises it’s going to be a lot quicker to develop for just one platform, a native application. The tool set is better. It’s going to be more performance better experience for the users. If you are rolling out a mobile website, rolling out to android, rolling out to tablets and phones, rolling out to iPads iPhones, if you are thinking in the back of your head about windows phone…. that’s when you start looking at … Hybrid development…. One codebase and quicker way to go to market
Bottom line is if you know the platform and a smoothly performing application use Native. If you want one code base and reach out to multiple platforms choose Hybrid.
Sales force Mobile Services:
Salesforce has done a good job in helping the developers to build enterprise mobile applications on the platform of their choice.
1.       Authentication model is already setup. Salesforce uses OAuth 2.0 to authenticate
2.       Smart Store - encrypted data store is available for both hybrid and native platforms
3.       Developer mobile packs to quickly start with app development
4.       Salesforce Mobile SDK – REST APIs and JavaScript Remoting give access to functionality and data
5.       VF components are available for quick HTML5 application development
6.       Mobile push notifications are in the roadmap



Sample: Posting a Comment with a File Attachment

July 09, 2013

In SFDC Apex Code Developer, a sample to post a comment with file attachment using Chatter Apex (ConnectAPI) is given. But, the code is not copy paste-able. You may see the following error for messageInput.messageSegments.add(textSegment);

System.NullPointerException: Attempt to de-reference a null object

The correct code looks as below. 

String communityId = null;
String feedItemId = '0D5x0000000azYd';
ConnectApi.CommentInput input = new ConnectApi.CommentInput();

ConnectApi.MessageBodyInput messageInput = new ConnectApi.MessageBodyInput();
messageInput.messageSegments = new List<ConnectApi.MessageSegmentInput>();

ConnectApi.TextSegmentInput textSegment;
textSegment = new ConnectApi.TextSegmentInput();
textSegment.text = 'Comment Text Body';
input.body = messageInput;
ConnectApi.NewFileAttachmentInput attachmentInput = new ConnectApi.NewFileAttachmentInput();
attachmentInput.description = 'The description of the file';
attachmentInput.title = 'contentFile.txt';
input.attachment = attachmentInput;
String fileContents = 'This is the content of the file.';
Blob fileBlob = Blob.valueOf(fileContents);
ConnectApi.BinaryInput binaryInput = new ConnectApi.BinaryInput(fileBlob, 'text/plain',

ConnectApi.Comment commentRep = ConnectApi.ChatterFeeds.postComment(communityId, feedItemId, input, binaryInput);


Why my custom prerequisite is saying it is corrupt?

July 05, 2013

Verifying file hash
Error: Setup has detected that the file '.....XXXXX.msi' has either changed since it was initially published or may be corrupt.
Option 1: Uninstall .NET 4.5 and work with .NET 4.0 only
Option 2: Sign your package files with a code-signing certificate to avoid a defect introduced in .NET 4.5



Updating SourceData/ Data Source of the Pivot Table

July 05, 2013


1. Run time error '1004' Cannot Open PivotTable source file

2. Data source reference is not valid

3. Can not use web data source as pivot data

4. Run-time error '-2147024809 (80070057)' Item with specified name wasn't found

5. Error: Cannot open PivotTable source file ‘[filename[x].xls]SourceData’

6. Exception from HRESULT: 0x800A03EC   at System.Dynamic.ComRuntimeHelpers.CheckThrowException(Int32 hresult, ExcepInfo& excepInfo, UInt32 argErr, String message)
   at CallSite.Target(Closure , CallSite , ComObject , String )
   at System.Dynamic.UpdateDelegates.UpdateAndExecute2[T0,T1,TRet](CallSite site, T0 arg0, T1 arg1)
   at CallSite.Target(Closure , CallSite , Object , String )
   at System.Dynamic.UpdateDelegates.UpdateAndExecute2[T0,T1,TRet](CallSite site, T0 arg0, T1 arg1)


We have spend a day searching around for the solution. We get this error when we open the excel sheet in Internet Explorer. 

We see this error  "Run time error '1004' Cannot Open PivotTable source file" when 
the following statement is run in VBA. 


As we can observe the sheet name is appended with URI. The is incorrect. In order to restore the SourceData to original value, we have to remove the URI from the datasheet.

This is not possible until file is READ-WRITE. Thus, on workbook open we have saved the file and made it writable in Workbook_Open Event.

private void Application_WorkbookOpen(Excel.Workbook xlb){

if (Globals.ThisAddIn.Application.ActiveWorkbook.FullName.Contains("servlet.FileDownload"))

                            Application.DisplayAlerts = false;                       
                            String str = Globals.ThisAddIn.Application.ActiveWorkbook.FullName;
                            int index = str.IndexOf("file=");
                            String tempId = str.Substring(index);
                            xlb.SaveAs("servlet.FileDownload" + tempId);
                            Application.DisplayAlerts = true;

By doing this we are ensuring that the data source of the file is editable.  But it doesn't change the source of pivot tables. 

Next step is to update the SourceData. Immediately, we tried 


With Sheets("TargetSheetName").PivotTables("PivotTableName").PivotCache

.SourceData = Sheets("SourceSheetName").Range("a16:CI51").Address(True, True, xlR1C1, True)
*TargetSheetName is where pivot table resides 

We met with an exception  HRESULT: 0x800A03EC

We worked around the problem by using the following

.ChangePivotCache(xlb.PivotCaches().Create(Excel.XlPivotTableSourceType.xlDatabase, "PivotSheetName!A14:U51"));                            


Sub Update_PTSource()
    With ActiveSheet
        .PivotTables("PivotSheetName").ChangePivotCache ActiveWorkbook. _
            PivotCaches.Create(SourceType:=xlDatabase, _
            SourceData:="'" & .Name & "'!PTsource")
    End With
End Sub




Scaper Wiki : Cloud Scraping Platform

July 03, 2013

Recently, I was trying to scrape some data over the web. Initially, I struggled to set up the my scraping environment based on Groovy. Soon enough, I came across Scraper Wiki, a cloud scraping platform. They have support for DOM Parsing and SQLLite database. And, the documentation is simple and well maintained. 
Scraper Page
I quickly choose PHP to scrape the data. Then, I could focus on the problem rather than worrying about setting up the platform. 

Following is the one of the scripts I have written to extract JSON data into SQL Database. 

1. Refereed to the existing table in ScrperWiki
2. Constructed the URL to fetch player statistics
3. Used json_decode to decode JSON data
4. Stored data in table using composite primary key (player_id,season)

require 'scraperwiki/simple_html_dom.php';
$playerIds=scraperwiki::select("distinct player_id from src.swdata desc");
foreach($playerIds as $playerid){

//print "JSON".$json_content;
if(strpos($json_content,"The page is not found"=== FALSE){       

        // PLAYER DATA
        //print "Id".$player_id;
        //OVERALL Batting Status
        print $player_fullName.$player_id."\n";
        foreach($myPlayerData["stats"as $stsType)
        // Prepare Overall Record
                'fullname'=> $player_fullName,
                'nationality'=> $player_Nationality,
                'dateOfBirth'=> $player_DOB,
                'player_id'=> $player_id,
                'BAT_Mat'=>$overAllBatting $overAllBatting["m""-",
                'BAT_Inns'=>$overAllBatting $overAllBatting["inns""-",
                'BAT_NO'=>$overAllBatting $overAllBatting["no""-"
                'BAT_Runs'=>$overAllBatting $overAllBatting["r""-"
                'BAT_HS'=>$overAllBatting $overAllBatting["hs""-"
                'BAT_Ave'=>$overAllBatting $overAllBatting["a""-",
                'BAT_BF'=>$overAllBatting $overAllBatting["b""-",
                'BAT_SR'=>$overAllBatting $overAllBatting["sr""-",
                'BAT_100'=>$overAllBatting $overAllBatting["100s""-",
                'BAT_50'=>$overAllBatting $overAllBatting["50s""-"
                'BAT_4s'=>$overAllBatting $overAllBatting["4s""-"
                'BAT_6s'=>$overAllBatting $overAllBatting["6s""-",
                'BAT_Ct'=>$overAllFielding $overAllFielding["c""-"
                'BAT_St'=> $overAllFielding $overAllFielding["s""-",
                'BOWL_Mat'=>$overAllBowling $overAllBowling["m""-",
                'BOWL_Inns'=>$overAllBowling $overAllBowling["inns""-",
                'BOWL_Balls'=> $overAllBowling $overAllBowling["b""-",
                'BOWL_Runs'=>$overAllBowling $overAllBowling["r""-"
                'BOWL_Dots'=>$overAllBowling $overAllBowling["d""-"
                'BOWL_Wkts'=>$overAllBowling $overAllBowling["w""-"
                'BOWL_BBM'=>$overAllBowling && $overAllBowling["bbmr"!== "-" && $overAllBowling["bbmw"!== "-" $overAllBowling["bbmw""-" +$overAllBowling["bbmr""-",
                'BOWL_Ave'=>$overAllBowling $overAllBowling["a""-",
                'BOWL_Econ'=> $overAllBowling $overAllBowling["e""-",
                'BOWL_SR'=>$overAllBowling $overAllBowling["sr""-"
                'BOWL_4w'=>$overAllBowling $overAllBowling["4w""-"
                'BOWL_5w'=>$overAllBowling $overAllBowling["5w""-");
        // Prepare Seasonal Data
        foreach ($overAllRecords["breakdown"as $seasonData){
            //print $seasonData["seasonId"]."\n";
        // Prepare Overall Record
                'fullname'=> $player_fullName,
                'nationality'=> $player_Nationality,
                'dateOfBirth'=> $player_DOB,
                'player_id'=> $player_id,
                'BAT_Mat'=>$overAllBatting $overAllBatting["m""-",
                'BAT_Inns'=>$overAllBatting $overAllBatting["inns""-",
                'BAT_NO'=>$overAllBatting $overAllBatting["no""-"
                'BAT_Runs'=>$overAllBatting $overAllBatting["r""-"
                'BAT_HS'=>$overAllBatting $overAllBatting["hs""-"
                'BAT_Ave'=>$overAllBatting $overAllBatting["a""-",
                'BAT_BF'=>$overAllBatting $overAllBatting["b""-",
                'BAT_SR'=>$overAllBatting $overAllBatting["sr""-",
                'BAT_100'=>$overAllBatting $overAllBatting["100s""-",
                'BAT_50'=>$overAllBatting $overAllBatting["50s""-"
                'BAT_4s'=>$overAllBatting $overAllBatting["4s""-"
                'BAT_6s'=>$overAllBatting $overAllBatting["6s""-",
                'BAT_Ct'=>$overAllFielding $overAllFielding["c""-"
                'BAT_St'=> $overAllFielding $overAllFielding["s""-",
                'BOWL_Mat'=>$overAllBowling $overAllBowling["m""-",
                'BOWL_Inns'=>$overAllBowling $overAllBowling["inns""-",
                'BOWL_Balls'=> $overAllBowling $overAllBowling["b""-",
                'BOWL_Runs'=>$overAllBowling $overAllBowling["r""-"
                'BOWL_Dots'=>$overAllBowling $overAllBowling["d""-"
                'BOWL_Wkts'=>$overAllBowling $overAllBowling["w""-"
                'BOWL_BBM'=>$overAllBowling && $overAllBowling["bbmr"!== "-" && $overAllBowling["bbmw"!== "-" $overAllBowling["bbmw""-" +$overAllBowling["bbmr""-",
                'BOWL_Ave'=>$overAllBowling $overAllBowling["a""-",
                'BOWL_Econ'=> $overAllBowling $overAllBowling["e""-",
                'BOWL_SR'=>$overAllBowling $overAllBowling["sr""-"
                'BOWL_4w'=>$overAllBowling $overAllBowling["4w""-"
                'BOWL_5w'=>$overAllBowling $overAllBowling["5w""-");
        print "Empty STATS";
    print "Empty JSON FOR".$playerid["player_id"]."\n";


Popular Posts