2017年8月14日星期一

Scrape Amazon Product Data with ASIN/UPC

So if you selling online, wouldn’t you want to know how your products are priced comparing to some popular ecommerce sites such as Amazon or Ebay. If your products are not selling, could it be a price problem, a product description problem or even the product pictures or maybe the products are just not good enough. This is exactly why people are turning to Amazon for product mining.
In this article I will show you how you can easily retrieve the product data you need from Amazon using web scraping tool, Octoparse. Let’s get straight to the point.

Step 1: Get prepared
Gather the list of ASIN/UPC for the products you need (to search with on Amazon).
ASIN, short for Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items on Amazon. Each item listed on Amazon will have a unique ASIN. If you happen to know what the ASIN’s are for the products you need to search for, great! If not, try UPC.
UPC, Universal Product Code (UPC) is a 12-digit bar code used extensively for retail packaging in United States. Find out what the UPC’s are for your products and make a list of it.

Step 2: Capture Data
Launch the application (download it at www.octoparse.com) and start a new task.
Click "Quick Start" ➜ Click "New Task (Advanced Mode)" – I prefer the Advanced Mode because it offers a lot more flexibility compared to the Wizard Mode. And I never find it too "advanced" to learn ➜ Fill in the basic information like Task Name, Category etc. After everything’s done, click "Next" to proceed to the next step. 

Drag a "Go to webpage" action into the workflow pane ➜ Enter the page URL: www.amazon.com ➜ Click "Save" ➜Amazon's webpage gets loaded in the built-in browser.

Click into the search box on the webpage ➜ when Octoparse prompts you for the next action, pick "Enter text value".

To search repeatedly with a list of UPC codes, we'll need to utilize the loop action. Drag a "loop" action into the workflow pane ➜ Select "Text List" ➜ Copy and paste the list of UPC codes into the text box ➜ Click "OK" ➜ Save ➜ Notice how the list of UPC’s gets added to "Loop Items"
 

Now, to make sense of the steps in the workflow, drag the "Input Text" action to the inside of the "Loop" action ➜ Under Advanced Options for the "Text Input" action, check "Use text in Loop Item to fill in the text box" ➜ Save ➜ Octoparse is now configured to use the text values added to the loop to search on Amazon's website.
 

Since the workflow had been re-arranged, click-through the workflow from the first action to make sure the defined actions are working as desired.
In my case, everything is working properly so when I get to the "Enter Text" step, the first UPC code from the list gets synched to the text box automatically.

 Next, click on the search button ➜ When prompted, select "Click an item" ➜ A "Click" action gets added to the loop automatically.

As soon as we are on the product detail page, capture whatever product information needed just like any other extraction task. Here, I will need product title, rating, number of review, product category and price.
Click on the title of the product ➜ Select "Extract Text" when prompted ➜ Notice how the product title gets added to the data pane next to the workflow designer ➜ Capture all other data fields similarly.

Edit the field names directly corresponding to the different data fields extracted.


So, is everything good to go now? Not so fast! There's still a bit of work to do for double checking - click through the workflow from the first action to the last to make sure everything works properly. Soon, we find that the "Click" action for the Search button doesn't really work.

For the majority of time, when something fails to work as desired, it is because the XPath auto-generated by Octoparse fails to locate the item correctly. To solve this, we'll need to modify the XPath manually. There are tons of XPath tutorials available in Octoparse's tutorial section (http://www.octoparse.com/features-tutorial?category=XPATH) so we'll skip the detailed steps and hop over to modify the XPath directly.
Select the action "Click an item", make sure it's the one for the Search button ➜ Click "Customized" under Advanced Option ➜ Click on "Define ways to locate an item" ➜ Copy and paste the correct XPath ( .//*[@id='nav-search']/form/div[2]/div/input )➜ Click "OK".

Now, re-run the workflow, select another UPC code besides the first one to test if the extraction steps are working right to locate the individual data fields.
  
As we get to the "Extract data" step, we'll notice that price is missing from the extracted data outputs.
 

So again, we’ll need to modify the XPath to the correct one (.//*[contains(text(), 'Price:')]/following-sibling::td/span[1]).

Finally, we are ready to run the extraction task. We can run it with the local machine utilizing local bandwidth, IP, memory etc. Alternatively, we can choose "Cloud Extraction" to run it in the Cloud (though only for paid users). If you are a heavy user, I strongly recommend Cloud extraction because it’s so much a relief to leave everything in the cloud and come back for the complete data set without having to worry about network interruption or computer getting froze. 
Here's what I got. The extracted data looks pretty neat.

So, that's everything I have for this tutorial. Here's the otd file I used for the demonstration. I hope you have enjoyed reading it. The Octoparse team is very supportive most of the time so in case if you need additional help, definitely reach out to them.
Thank you and stay tuned for more Amazon scraping tips! Cheers!

Related tutorials:

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页