Requests vs Pycurl
When looking for python online data collection libraries (a.k.a. Scraping libraries) the main result is requests , however there is a much better library in my opinion called PyCurl, it is the python interface for libcurl which is basically the native library for curl . In this project I get immediately blocked when using requests, moreover the interesting part of the project is that we won't scrape an html code that we then have to parse with beautifulsoup, we will instead scrape data directly in a Json format. This gives us many advantages because Json is basically a dictionary that we can easily navigate, without need to learn a new library. Please have a look to this stackoverflow page if you are more interested in PyCurl vs Requests performances.
Introduction
As Data Scientist we are probably used to query our data using SQL, I have noticed a very common problem in Junior but also Senior analytics professional, they think that a table in a database is at the very beginning of the process, this is because it is their process within a bigger one; sometimes it is just a matter of point of view, for a Data Engineer a table in a database is at the end of the process!
If we want to have a broader view we need to look out of the Analytics bubble, this is the reason why i started this project.
When scraping data, the most important part is the website structure analysis, in fact I have spent many hours navigating the website to get as much info as possible. I would like to guide you through some of the encountered problems.
Javascript generated content
For complex websites you cannot directly scrape related HTML because most of the time there will be a JavaScript script generating informations. I have started considering Selenium to simulate a webdriver which would have fully loaded HTML info and then parse all this info with beautiful soup.
Selenium
The first problem of vivino website using Selenium was to fully load the wine list scrolling to the bottom, this require a scroll call with selenium until the end is reached, however by using this strategy you have to perform multiple calls and there may be the risk of being banned.
Find the most important JSON
When navigating the website I was not able to find a Json file with most of the info, this is because you get such file only after you start scrolling to the end of the first page. We now have a huge dictionary containing all details about a wine, even if some information in such dictionary are not displayed in the page you are looking.
Tables that will be created from this code
- vintage (this is the main table)
- region_most_used_grapes
- flavor
- flavor_primary_keywords
- flavor_secondary_keywords
- prices
- food
- grapes
Start of The Project
Import Libraries
import time
import pycurl
import certifi
from io import BytesIO
import json
import pandas as pd
Performing cURL Request
%%time
n = 0
pages_available = True
wines_collection = []
for i in range(100):
# Creating a buffer as the cURL is not allocating a buffer for the network response
buffer = BytesIO()
# Initialising the curl object
curl = pycurl.Curl()
# Setting the request URL, the user agent, and the buffer for the response
url = f'https://www.vivino.com/api/explore/explore?country_code=IT¤cy_code=EUR&grape_filter=varietal&min_rating=1&order_by=price&order=desc&price_range_max=500+&price_range_min=0&wine_type_ids%5B%5D=1&wine_type_ids%5B%5D=2&wine_type_ids%5B%5D=3&wine_type_ids%5B%5D=24&wine_type_ids%5B%5D=7&wine_type_ids%5B%5D=4&page={n}&language=en'
curl.setopt(pycurl.URL, url)
curl.setopt(pycurl.USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0")
curl.setopt(pycurl.WRITEDATA, buffer)
curl.setopt(pycurl.CAINFO, certifi.where())
# Performing the actual request
curl.perform()
# Accessing the response code and adding 1 to n only if successful, this step is not necessary it allows you to check each step
if curl.getinfo(pycurl.RESPONSE_CODE) == 200:
#print(f'Success : {n} {curl.getinfo(pycurl.RESPONSE_CODE)}', )
n +=1
else:
print(f'Not Success{curl.getinfo(pycurl.RESPONSE_CODE)}')
# Ending the session and freeing the resources
curl.close()
#Convert Buffer from Byte to string to Dict
page_dict = json.loads(buffer.getvalue().decode())
#page_dict is now a dictionaire containing three main dictionaries, the one we want is "explore_vintage", it contains another dict "matches"
#matches is a list of dict
for i in page_dict['explore_vintage']['matches']:
wines_collection.append(i)
if n%10 == 0:
print(len(wines_collection))
Output:
CPU times: user 4.74 s, sys: 541 ms, total: 5.28 s
Wall time: 1min 23s
Check Buffer and Navigate the Tree
This step is purely to show how to transformin buffer into a dictionary from Bytes -> String -> Dict and print each result
#buffer getvalue() now store informations we need in a bytes format
print(type(buffer.getvalue()))
#hence we will decode bytes into string
string = buffer.getvalue().decode()
print(type(string))
#at this stage we can convert our string into a dictionary using json library
wines_info = json.loads(string)
print(type(wines_info))
Output:
<class 'bytes'>
<class 'str'>
<class 'dict'>
We can now have a look at our newly created dictionary structure
#We have three main keys
print(wines_info.keys())
#explore vintage has more keys
print(wines_info['explore_vintage'].keys())
#matches provides a list of values, let's see keys of the first value of this list
print(wines_info['explore_vintage']['matches'][0].keys())
print(wines_info['explore_vintage']['matches'][0]['vintage'].keys())
Output:
dict_keys(['explore_vintage', 'selected_filters', 'e'])
dict_keys(['market', 'records_matched', 'matches', 'bottle_type_errors', 'records'])
dict_keys(['vintage', 'price', 'prices'])
dict_keys(['id', 'seo_name', 'name', 'statistics', 'image', 'wine', 'year', 'grapes', 'has_valid_ratings'])
Build Pandas Dataframe
We will build a dataframe using one main pandas function, json_normalize, the first json flttening will be the easiest one, some other will require to add some parameter to json_normalize function, while at the very end we will use pure python code to manipulate our json.
Once you flatten Json and create a Dataframe with json_normalize, you can notice that there are still some columns containing a dictionary, this is because the json we just downloaded has many layers with which we can build multiple tables, each table can the be connected with primary keys like in a relational database.
vintage = pd.json_normalize(wines_collection)
vintage.head(2)
prices |
vintage.id |
vintage.seo_name |
vintage.name |
vintage.statistics.status |
vintage.statistics.ratings_count |
vintage.statistics.ratings_average |
vintage.statistics.labels_count |
vintage.statistics.wine_ratings_count |
vintage.statistics.wine_ratings_average |
vintage.statistics.wine_status |
vintage.image.location |
vintage.image.variations.bottle_large |
vintage.image.variations.bottle_medium |
vintage.image.variations.bottle_medium_square |
vintage.image.variations.bottle_small |
vintage.image.variations.bottle_small_square |
vintage.image.variations.label |
vintage.image.variations.label_large |
vintage.image.variations.label_medium |
vintage.image.variations.label_medium_square |
vintage.image.variations.label_small_square |
vintage.image.variations.large |
vintage.image.variations.medium |
vintage.image.variations.medium_square |
vintage.image.variations.small_square |
vintage.wine.id |
vintage.wine.name |
vintage.wine.seo_name |
vintage.wine.type_id |
vintage.wine.vintage_type |
vintage.wine.is_natural |
vintage.wine.region.id |
vintage.wine.region.name |
vintage.wine.region.name_en |
vintage.wine.region.seo_name |
vintage.wine.region.country.code |
vintage.wine.region.country.name |
vintage.wine.region.country.native_name |
vintage.wine.region.country.seo_name |
vintage.wine.region.country.currency.code |
vintage.wine.region.country.currency.name |
vintage.wine.region.country.currency.prefix |
vintage.wine.region.country.currency.suffix |
vintage.wine.region.country.regions_count |
vintage.wine.region.country.users_count |
vintage.wine.region.country.wines_count |
vintage.wine.region.country.wineries_count |
vintage.wine.region.country.most_used_grapes |
vintage.wine.region.background_image |
vintage.wine.winery.name |
vintage.wine.winery.seo_name |
vintage.wine.winery.status |
vintage.wine.winery.background_image |
vintage.wine.taste.structure.acidity |
vintage.wine.taste.structure.fizziness |
vintage.wine.taste.structure.intensity |
vintage.wine.taste.structure.sweetness |
vintage.wine.taste.structure.tannin |
vintage.wine.taste.structure.user_structure_count |
vintage.wine.taste.structure.calculated_structure_count |
vintage.wine.taste.flavor |
vintage.wine.statistics |
vintage.wine.style.id |
vintage.wine.style.seo_name |
vintage.wine.style.regional_name |
vintage.wine.style.varietal_name |
vintage.wine.style.name |
vintage.wine.style.image |
vintage.wine.style.background_image |
vintage.wine.style.interesting_facts |
vintage.wine.style.body |
vintage.wine.style.body_description |
vintage.wine.style.acidity |
vintage.wine.style.acidity_description |
vintage.wine.style.country.code |
vintage.wine.style.country.name |
vintage.wine.style.country.native_name |
vintage.wine.style.country.seo_name |
vintage.wine.style.country.currency.code |
vintage.wine.style.country.currency.name |
vintage.wine.style.country.currency.prefix |
vintage.wine.style.country.currency.suffix |
vintage.wine.style.country.regions_count |
vintage.wine.style.country.users_count |
vintage.wine.style.country.wines_count |
vintage.wine.style.country.wineries_count |
vintage.wine.style.country.most_used_grapes |
vintage.wine.style.wine_type_id |
vintage.wine.style.food |
vintage.wine.style.grapes |
vintage.wine.style.region.id |
vintage.wine.style.region.name |
vintage.wine.style.region.name_en |
vintage.wine.style.region.seo_name |
vintage.wine.style.region.country.code |
vintage.wine.style.region.country.name |
vintage.wine.style.region.country.seo_name |
vintage.wine.style.region.country.currency.code |
vintage.wine.style.region.country.currency.name |
vintage.wine.style.region.country.currency.prefix |
vintage.wine.style.region.country.currency.suffix |
vintage.wine.style.region.country.regions_count |
vintage.wine.style.region.country.users_count |
vintage.wine.style.region.country.wines_count |
vintage.wine.style.region.country.wineries_count |
vintage.wine.style.region.country.most_used_grapes |
vintage.wine.style.region.parent_id |
vintage.wine.style.region.background_image.location |
vintage.wine.style.region.background_image.variations.large |
vintage.wine.style.region.background_image.variations.medium |
vintage.wine.style.region.statistics.wineries_count |
vintage.wine.style.region.statistics.wines_count |
vintage.wine.style.region.statistics.sub_regions_count |
vintage.wine.style.region.statistics.parent_regions_count |
vintage.wine.style.parent_style_id |
vintage.wine.style.hidden |
vintage.wine.style.statistics.wines_count |
vintage.wine.style.statistics.aggregated_wines_count |
vintage.wine.style.vintage_mask |
vintage.wine.style.baseline_structure.acidity |
vintage.wine.style.baseline_structure.fizziness |
vintage.wine.style.baseline_structure.intensity |
vintage.wine.style.baseline_structure.sweetness |
vintage.wine.style.baseline_structure.tannin |
vintage.wine.has_valid_ratings |
vintage.year |
vintage.grapes |
vintage.has_valid_ratings |
price.id |
price.merchant_id |
price.amount |
price.discounted_from |
price.discount_percent |
price.type |
price.sku |
price.url |
price.visibility |
price.bottle_type_id |
price.currency.code |
price.currency.name |
price.currency.prefix |
price.currency.suffix |
price.xms |
price.xmphs |
price.price_group_id |
price.vfm_score |
price.bottle_type.id |
price.bottle_type.name |
price.bottle_type.short_name |
price.bottle_type.short_name_plural |
price.bottle_type.volume_ml |
vintage.wine.taste.structure |
vintage.wine.style |
vintage.wine.region.background_image.location |
vintage.wine.region.background_image.variations.large |
vintage.wine.region.background_image.variations.medium |
vintage.wine.winery.background_image.location |
vintage.wine.winery.background_image.variations.large |
vintage.wine.winery.background_image.variations.medium |
vintage.wine.winery.background_image.variations.small |
vintage.top_list_rankings |
vintage.wine.style.region.background_image |
vintage.wine.style.region |
[{'id': 35134561, 'merchant_id': 18999, 'amoun... |
156099154 |
domaine-de-la-romanee-conti-romanee-conti-gran... |
Domaine de La Romanée-Conti Romanée-Conti Gran... |
Normal |
47 |
4.7 |
715 |
9766 |
4.7 |
|
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
83912 |
Romanée-Conti Grand Cru |
romanee-conti-grand-cru |
1 |
0 |
False |
2228 |
Romanée-Conti Grand Cru |
Romanée-Conti |
romanee-conti |
fr |
Frankrijk |
France |
france |
EUR |
Euros |
€ |
None |
1307 |
6550621 |
433651 |
68039 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
NaN |
Domaine de La Romanée-Conti |
domaine-de-la-romanee-conti |
0 |
NaN |
4.101636 |
NaN |
3.041284 |
1.959191 |
2.503016 |
354.0 |
159.0 |
[{'group': 'red_fruit', 'stats': {'count': 87,... |
None |
283.0 |
burgundy-cote-de-nuits-red |
Bourgogne |
Rode wijn uit Côte de Nuits |
Bourgogne Côte de Nuits Rood |
NaN |
NaN |
None |
3.0 |
Medium-bodied |
3.0 |
Hoog |
fr |
Frankrijk |
France |
france |
EUR |
Euros |
€ |
NaN |
1307.0 |
6550621.0 |
433651.0 |
68039.0 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
1.0 |
[{'id': 4, 'name': 'Rundvlees', 'weight': 0.5,... |
[{'id': 14, 'name': 'Pinot noir', 'seo_name': ... |
408.0 |
Côte de Nuits |
cote-de-nuits |
fr |
Frankrijk |
france |
EUR |
Euros |
€ |
NaN |
1307.0 |
6550621.0 |
433651.0 |
68039.0 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
385.0 |
//images.vivino.com/regions/backgrounds/oTEcWw... |
//thumbs.vivino.com/region_backgrounds/oTEcWwU... |
//thumbs.vivino.com/region_backgrounds/oTEcWwU... |
766.0 |
7329.0 |
11.0 |
1.0 |
23.0 |
False |
13612.0 |
13612.0 |
1< |
4.5 |
NaN |
3.0 |
1.0 |
2.5 |
False |
2018 |
None |
True |
35134561 |
18999 |
50960.0 |
NaN |
NaN |
vc |
MTH2204156 |
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
EUR |
Euros |
€ |
None |
0.722997 |
False |
[xom] |
0.0 |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
[{'id': 35695434, 'merchant_id': 35791, 'amoun... |
1335502 |
domaine-de-la-romanee-conti-romanee-conti-gran... |
Domaine de La Romanée-Conti Romanée-Conti Gran... |
Normal |
134 |
4.6 |
1094 |
9766 |
4.7 |
|
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
//images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... |
83912 |
Romanée-Conti Grand Cru |
romanee-conti-grand-cru |
1 |
0 |
False |
2228 |
Romanée-Conti Grand Cru |
Romanée-Conti |
romanee-conti |
fr |
Frankrijk |
France |
france |
EUR |
Euros |
€ |
None |
1307 |
6550621 |
433651 |
68039 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
NaN |
11610 |
Domaine de La Romanée-Conti |
domaine-de-la-romanee-conti |
0 |
NaN |
4.101636 |
NaN |
3.041284 |
1.959191 |
2.503016 |
354.0 |
159.0 |
[{'group': 'red_fruit', 'stats': {'count': 87,... |
None |
283.0 |
burgundy-cote-de-nuits-red |
Bourgogne |
Rode wijn uit Côte de Nuits |
Bourgogne Côte de Nuits Rood |
NaN |
NaN |
None |
None |
None |
3.0 |
Medium-bodied |
3.0 |
Hoog |
fr |
Frankrijk |
France |
france |
EUR |
Euros |
€ |
NaN |
1307.0 |
6550621.0 |
433651.0 |
68039.0 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
1.0 |
[{'id': 4, 'name': 'Rundvlees', 'weight': 0.5,... |
[{'id': 14, 'name': 'Pinot noir', 'seo_name': ... |
408.0 |
Côte de Nuits |
cote-de-nuits |
fr |
Frankrijk |
France |
france |
EUR |
Euros |
€ |
NaN |
1307.0 |
6550621.0 |
433651.0 |
68039.0 |
[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ... |
385.0 |
//images.vivino.com/regions/backgrounds/oTEcWw... |
//thumbs.vivino.com/region_backgrounds/oTEcWwU... |
//thumbs.vivino.com/region_backgrounds/oTEcWwU... |
766.0 |
7329.0 |
11.0 |
1.0 |
23.0 |
False |
13612.0 |
13612.0 |
1< |
4.5 |
NaN |
3.0 |
1.0 |
2.5 |
False |
1982 |
None |
True |
35695434 |
35791 |
18500.0 |
NaN |
NaN |
vc |
8351718768976 |
https://rarest-wines.com/products/romanee-cont... |
1 |
1 |
EUR |
Euros |
€ |
None |
0.000000 |
False |
[xom] |
0.0 |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
Nested JSON
From the table above, you may have noticed that the scraped Json is not fully transformed into a table, indeed we still have some list of dictionaries nested in our created dataframe. Check the columns "Prices", each wine may have multiple prices offered by different vendor, this is the reason why such variable is more complex than the others. There are other variables like this, from such variables we can create other tables that can be interpeted as dimensional tables.
It seems that from our simple Json we are able to build a proper schema.
Before we have used json_normalize just by giving json file as imput parameter, when dealing with nested json we can pass a list to the record_ path argument, such list represents nested json keys in order until the last one (where what you need is store) is reached.
Another parameter called meta can be provided, in this case we specify the location of a single value related to that record, in this case i will use the wine id so that we will be able in the future to join back all the information related to a wine.
Transform: vintage.wine.region.country.most_used_grapes
note how most_used_grapes is the last value of the list in record_ path
region_most_used_grapes = pd.json_normalize(wines_collection,
record_path = ['vintage', 'wine', 'region', 'country', 'most_used_grapes'],
record_prefix = 'vintage_wine_region_country_MostUsedGrapes_',
meta = [['vintage', 'id']])
region_most_used_grapes.head(10)
vintage_wine_region_country_MostUsedGrapes_id |
vintage_wine_region_country_MostUsedGrapes_name |
vintage_wine_region_country_MostUsedGrapes_seo_name |
vintage_wine_region_country_MostUsedGrapes_has_detailed_info |
vintage_wine_region_country_MostUsedGrapes_wines_count |
vintage_wine_region_country_MostUsedGrapes_parent_grape_id |
vintage.id |
14 |
Pinot Noir |
pinot-noir |
True |
572334 |
NaN |
156099154 |
5 |
Chardonnay |
chardonnay |
True |
604208 |
NaN |
156099154 |
10 |
Merlot |
merlot |
True |
566719 |
NaN |
156099154 |
14 |
Pinot Noir |
pinot-noir |
True |
572334 |
NaN |
1335502 |
5 |
Chardonnay |
chardonnay |
True |
604208 |
NaN |
1335502 |
10 |
Merlot |
merlot |
True |
566719 |
NaN |
1335502 |
15 |
Riesling |
riesling |
True |
262136 |
NaN |
156856757 |
299 |
Spätburgunder |
spatburgunder |
False |
50164 |
14.0 |
156856757 |
355 |
Weissburgunder |
weissburgunder |
False |
33464 |
36.0 |
156856757 |
14 |
Pinot Noir |
pinot-noir |
True |
572334 |
NaN |
2161228 |
Transform: vintage.wine.tase.flavor
flavor contains another dictionary that we will transform later without using json_normalize but pure python
flavor = pd.json_normalize(wines_collection,
record_path = ['vintage', 'wine', 'taste', 'flavor'],
meta = [['vintage', 'id']])
flavor.head(10)
group |
primary_keywords |
secondary_keywords |
stats.count |
stats.score |
vintage.id |
red_fruit |
[{'id': 348, 'name': 'raspberry', 'count': 34}... |
[{'id': 38, 'name': 'black cherry', 'count': 1... |
87 |
11822 |
156099154 |
earth |
[{'id': 242, 'name': 'leather', 'count': 33}, ... |
[{'id': 134, 'name': 'dark chocolate', 'count'... |
73 |
8582 |
156099154 |
oak |
[{'id': 292, 'name': 'oak', 'count': 11}, {'id... |
[{'id': 242, 'name': 'leather', 'count': 33}, ... |
69 |
4570 |
156099154 |
black_fruit |
[{'id': 334, 'name': 'plum', 'count': 14}, {'i... |
[{'id': 341, 'name': 'prune', 'count': 2}] |
49 |
6034 |
156099154 |
non_oak |
[{'id': 62, 'name': 'brioche', 'count': 1}, {'... |
[{'id': 117, 'name': 'coffee', 'count': 7}, {'... |
30 |
603 |
156099154 |
spices |
[{'id': 320, 'name': 'pepper', 'count': 6}, {'... |
[{'id': 434, 'name': 'vanilla', 'count': 2}, {... |
21 |
2195 |
156099154 |
floral |
[{'id': 436, 'name': 'violet', 'count': 5}, {'... |
[{'id': 381, 'name': 'savory', 'count': 2}] |
19 |
2050 |
156099154 |
microbio |
[{'id': 238, 'name': 'lager', 'count': 1}, {'i... |
[{'id': 284, 'name': 'mushroom', 'count': 9}, ... |
12 |
700 |
156099154 |
citrus_fruit |
[{'id': 298, 'name': 'orange', 'count': 3}, {'... |
NaN |
7 |
700 |
156099154 |
dried_fruit |
[{'id': 341, 'name': 'prune', 'count': 2}, {'i... |
[{'id': 229, 'name': 'jam', 'count': 1}] |
5 |
394 |
156099154 |
Transform: prices
prices = pd.json_normalize(wines_collection,
record_path = 'prices',
record_prefix = 'prices_', meta = [['vintage', 'id']])
prices.head(10)
prices_id |
prices_merchant_id |
prices_amount |
prices_discounted_from |
prices_discount_percent |
prices_type |
prices_sku |
prices_url |
prices_visibility |
prices_bottle_type_id |
prices_xms |
prices_xmphs |
prices_price_group_id |
prices_vfm_score |
prices_currency.code |
prices_currency.name |
prices_currency.prefix |
prices_currency.suffix |
prices_bottle_type.id |
prices_bottle_type.name |
prices_bottle_type.short_name |
prices_bottle_type.short_name_plural |
prices_bottle_type.volume_ml |
vintage.id |
35134561 |
18999 |
50960.0 |
NaN |
NaN |
vc |
MTH2204156 |
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
0.722997 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
156099154 |
35695434 |
35791 |
18500.0 |
NaN |
NaN |
vc |
8351718768976 |
https://rarest-wines.com/products/romanee-cont... |
1 |
1 |
0.000000 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
1335502 |
34993404 |
18999 |
15567.2 |
NaN |
NaN |
vc |
|
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
0.722513 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
156856757 |
35695443 |
35791 |
15000.0 |
NaN |
NaN |
vc |
8353245561168 |
https://rarest-wines.com/products/romanee-cont... |
1 |
1 |
0.000000 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
2161228 |
35695467 |
35791 |
11000.0 |
NaN |
NaN |
vc |
8362932011344 |
https://rarest-wines.com/products/romanee-cont... |
1 |
1 |
0.000000 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
1764869 |
34589484 |
34568 |
10240.0 |
NaN |
NaN |
vc |
8512976093452-44827393491212 |
https://johnstonfinewines.com/products/domaine... |
1 |
1 |
0.000000 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
1908340 |
31862266 |
18999 |
9031.0 |
NaN |
NaN |
vc |
MTH2204414 |
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
0.722513 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
23828669 |
31862268 |
18999 |
9000.0 |
NaN |
NaN |
vc |
MTH2204416 |
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
0.722513 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
150850054 |
35226807 |
18999 |
8879.0 |
NaN |
NaN |
vc |
MTH2400676 |
https://www.soloprodottiitaliani.it/spec_colle... |
1 |
1 |
0.722513 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
17476587 |
34589482 |
34568 |
8630.0 |
NaN |
NaN |
vc |
8512987070732-44827434090764 |
https://johnstonfinewines.com/products/domaine... |
1 |
1 |
0.000000 |
False |
[xom] |
0.0 |
EUR |
Euros |
€ |
None |
1 |
Fles (0.75 l) |
fles |
fles |
750 |
1204536 |
Nested JSON without json_normalize
what happen if in some cases and entire index is missing on your json? you can't use json_normalize, this is a great news because we can start thinking a little bit more to build our own json_ normalize function, I called it dict_ to_ table a very original name.
data is the dataframe where nested values are stored and specified with the parameter column, key is where the primary key of the wine is stored.
the following function loop through the nested dictionary and associate the wine key with function zip. the results is a list of dictionaries that can be easily read by pandas.DataFrame.
def dict_to_table(data, column, key):
collect = []
for i,j in zip(data[column], data[key]):
try:
for n in i:
n[key] = j
collect.append(n)
except:
pass
return collect
flavor_primary_keywords = pd.DataFrame(dict_to_table( data = flavor,
column = 'primary_keywords',
key = 'vintage.id' ))
flavor_primary_keywords.head(10)
id |
name |
count |
vintage.id |
348 |
raspberry |
34 |
156099154 |
93 |
cherry |
28 |
156099154 |
400 |
strawberry |
25 |
156099154 |
354 |
red fruit |
18 |
156099154 |
352 |
red cherry |
5 |
156099154 |
156 |
earthy |
23 |
1335502 |
284 |
mushroom |
9 |
1335502 |
384 |
smoke |
7 |
1335502 |
276 |
minerals |
5 |
1335502 |
430 |
truffle |
2 |
1335502 |
If you are reading this, it means that you went thorough the entire project, do no hesitate to contact me or schedule a meeting if you have any advice or interesting problems we can try to solve together!