Excel Scraping? This may be a new term. But in term of programming, Scraping is a data harvesting technique to programmatically pull data out of a formatted sources.
For instance, web scrapping refers to the way how programmer writes codes to extract data from a nicely, and human-eye-friendly html page(s) into a data source. Generally, it’s a bad idea because the way how we extract data would largely depend on how the page is formatted, which is often without semantics. In other words, if the web designer intentionally or unintentionally change a small bit on the layout, your codes mostly fail immediately. As a result, scraping is not reliable and its usage is not suggested. It should ONLY be used when there is no other choice available and you are forced to get those data.
When working in a stationery distributor, or wholesaler, it’s unavoidable to deal with a lot of data – sales data, buying data, inventory data, to name a few. In my environment, we are talking about managing more than 30k unique inventory items that could be sold and we need all description, unit of measure (UOM), minimum order quantity (MOQ) and cost price, sell price updated in a timely and accurate manner whenever our upstream suppliers update their pricing. Each supplier generally carries hundreds to thousands of products, it is nearly impossible for any person to manually update the records one by one. All we have to do, (or actually, the only thing we could do), is to do a bulk import/update.
Yes, If the supplier can provide us a CSV file or Excel spreadsheet, it’s relatively easier to import into a database. Most database systems can deal with it easily. However, let’s consider the following Excel file:
Price has been covered to protect the interests of company and supplies.
What should the actual product description to be? Should it be:
- The description in the same line of product code?
- A combination of the Black and Bold sub-heading + the description?
- A combination of the Red heading + Black and Bold sub-heading + the description?
Yes, from this screenshot, most of the data in the description repeats the heading. For simplicity, we possibly choose #1. HOWEVER, let’s look at other part of the SAME file.
Now, for all those nice Tapes, the description on the same line is NOT meaningful at all, and it requires the heading to form a meaningful description. should SD18482 be named as:
- RHINO ELECTRONIC INDUSTRIAL TOOLS & TAPES PERMANENT POLYESTER – LENGTH 5.5M TAPE, WHITE PERM POLY 9MM; or
- PERMANENT POLYESTER – LENGTH 5.5M TAPE, WHITE PERM POLY 9MM; or
- something else?
The real problem is, there is even no universal rule applicable in the SAME file. How could we effectively import this price list into our system? Product code can be changed in the later file, so does the EAN code. All the processing has to rely on the line description to determine when (not) to include the heading. What it means is that it’s technically impossible to write an import program to process the file. At most, what I can do is to write some source codes to assist me processing the file, and the program need to be reviewed and recompiled every time a new file comes in even though they may be from the same supplier. Remember, this file is layout-centric, which means supplier can change whenever and whatever they like, including the text of the heading.
Let’s continue how I attempt to address this problem next time.
