Extracting Google-Indexed Web Site Pages Using MS Excel

In the field of Internet marketing research or search engine optimization, it is very important to know which pages of your web site have been indexed in Google. Currently, by default there is no way to export indexed pages in Google to a convenient spreadsheet for further analysis. If you want to learn how to do this fairly easily, however, keep reading.

Popular spreadsheet software such as MS Excel is readily available to accomplish this task. However, the problem is a bit challenging: “How can you extract indexed pages of our website to MS Excel?”

First, before we can proceed with this tutorial, we will illustrate the importance of extracting indexed pages:

  • Duplicate content analysis in search engine optimization (for example, finding similar pages in your website) is only possible if you can compare one page to another by having a list of URLs in your website. 

  • Spotting the important and low value URLs in your website are key factors in optimizing traffic. Without a list of these indexed URLs, it will be very hard to improve. 

  • Lastly, if you are concerned with your website security, by extracting all pages of your website which have been indexed in Google, you can easily know which URLs affect your web site’s security. Examples of such URLs are admin pages, download pages and customer information pages. By nature, Google indexes URLs if your server does not restrict them and if there are links pointing to them.

Currently, Google is the most powerful search engine on the planet, yet it does not offer tools to let any webmaster extract or easily download indexed pages. Compare this to Yahoo, with its Yahoo Site Explorer that offers tools for downloading indexed pages. However, note that this data is only good in Yahoo, and is NOT applicable for Google.

These tutorials apply only to MS Excel 2002- 2007; note that Google can only give up to 1000 URLs.

{mospagebreak title=Understanding the Google Search Result}

When you visit http://www.google.com to find the indexed pages of your website, it is not as easy as typing in your domain name and let Google provide you a complete list of indexed pages. You have to use proper syntax in order for Google to give the complete list of URLs.

The indexed pages of your domain can be determined by entering this syntax in the Google search box and then pressing “Google Search” site:domainname.com

For example, if your website is http://www.thisisyourwebsite.com , you have to enter it as: site:thisisyourwebsite.com (not including the www).

We do not include the www in the search query so we can capture all URLs, including the other sub-domains in your website (not only those that start with “www”).

Below is the screen shot of the Google search result (circled in red are the indexed URLs which you will need to extract; the other results are irrelevant to your site):

Also note that the Google search result by default shows only the first ten results. Extracting the indexed URLs of your site from the Google search result to your Excel spreadsheet involves the copy and paste method. So we need to change this to the maximum (which is 100), which will speed up the process. Follow this simple procedure:

Step 1: Enter site:thisisyourwebsite.com in the Google search box.

Step 2: Beside the Google search box, click “preferences.”

Step 3: In the section for number of results, change it from 10 to 100.

Step 4: Click “Save Preferences.”

{mospagebreak title=The Process}

This method does not involve Visual Basic programming in Excel, but plain manipulation using the built-in text functions.

Step 1. After entering the search query: site:thisisyourwebsite.com in the Google search box, select the appropriate portion of data in the Google search result.

Only select the indexed pages and nothing else; see the sample screen shot below (highlighted regions are the selected areas to be copied and pasted into the Excel spreadsheet).

Do not include Ads and Sponsored results in the text selection.

Step 2. Right click and copy, open MS Excel, and then on cell A1: Right click again, Paste special, and paste as “text.”

After pasting the data, it should look like this:

After pasting the data as text, we will now filter the column for relevant information:

Step 3: Click on cell A1, then on the Excel buttons, click “Data,” Filter, and finally Auto filter.

Step 4: On the auto filter, click the drop down arrow button and then click “Custom.” The Custom Auto Filter Dialog Box will then appear.

Step 5: On the first drop down menu under “Shows rows where #:” select “Contains,” and then on the second drop down menu type www.aspfree.com. This means you will need to filter only information containing www.aspfree.com — only the indexed URLs of the domain.

Step 6: On the Custom Auto filter Dialog Box, find the checkbox containing “AND” and also “OR.” Check the “AND” checkbox.

Step 7. Below the AND/OR checkboxes, there are two additional drop down menus. On the left (first) drop down, select “contains” and then on the right (second) drop down, type “cached.” Finally, when everything is set, click OK.

In case you found the above steps confusing, make sure the Custom Auto filter Dialog Box looks like this after following Steps 1 to 7:

{mospagebreak title=Explaining the Results}

The above steps should filter rows and leave only rows containing the following domain name URL: www.aspfree.com and cached link.

After filtering the rows, select the result of the filtered rows and copy and paste that into another Excel worksheet.

The result of the filtered rows should have information somewhat like this:

www.aspfree.com/ – 75k – Cached – Similar pages –

www.aspfree.com/c/b/XML/ – 47k – Cached – Similar pages –

www.aspfree.com/c/b/IIS/ – 47k – Cached – Similar pages –

www.aspfree.com/asp/freeasphost.asp – 77k – Cached – Similar pages –

www.aspfree.com/c/b/Silverlight/ – 46k – Cached – Similar pages –

www.aspfree.com/c/b/BrainDump/ – 47k – Cached – Similar pages –

www.aspfree.com/c/b/ASP/ – 47k – Cached – Similar pages –

At this point, the results are not yet the indexed URLs, as they contain unrelated data, such as the file size of the web page (75k, for example) and other stuff.

What we will do is extract only the URLs in the rows, so for example in this data:

www.aspfree.com/c/b/BrainDump/ – 47k – Cached – Similar pages –

We will extract only the URL so that it will now be:

www.aspfree.com/c/b/BrainDump/

To do that in MS Excel, you will have to use the function:

=CONCATENATE("http://",MID(A1,1,(FIND(" ",A1,1))-1))

So after filtering the information and copying that to another sheet as instructed in the previous steps, copy and paste that formula into cell B1, with your filtered data in cell A1.

Column B in the Excel worksheet will now give the indexed URLs. If the data to be filtered is placed starting in cell A1, make sure that the formula is placed in cell B1, and it will be:

=CONCATENATE("http://",MID(A1,1,(FIND(" ",A1,1))-1))

Click and drag the formula until all data are correctly filtered for clean URLs.

The only disadvantage with this method is that the results will be inaccurate for long URLs. So after filtering and applying text manipulation using Excel, it will be useless because Google displays dots for long URLs at the end. For example:

www.somewebsite.com/this…/should-be-a-very-very-long-url-which-google-will-display-properly-and-will-makefiltering-hard…

In this case, you have to open it manually using the browser to extract URLs for what is shown exactly in the address bar. Having long URLs is not recommended, as it tends to look unfriendly and spammy for search engines. In this case, consider shortening the URLs in your website.

4 thoughts on “Extracting Google-Indexed Web Site Pages Using MS Excel

  1. Hi,
    The post is helpful. Though I think we can further enhance the last step where we are only left with url and additional text to be extracted.
    We can delimit the text and convert text to column with delimiter as just a single space. Since url are always continous. And delete the additional.

[gp-comments width="770" linklove="off" ]