Data Collection

Scraping with Pycurl - How to Transform Json to a Pandas Dataframe

Projects
Data Collection

Requests vs Pycurl

When looking for python online data collection libraries (a.k.a. Scraping libraries) the main result is requests , however there is a much better library in my opinion called PyCurl, it is the python interface for libcurl which is basically the native library for curl . In this project I get immediately blocked when using requests, moreover the interesting part of the project is that we won't scrape an html code that we then have to parse with beautifulsoup, we will instead scrape data directly in a Json format. This gives us many advantages because Json is basically a dictionary that we can easily navigate, without need to learn a new library. Please have a look to this stackoverflow page if you are more interested in PyCurl vs Requests performances.

Introduction

As Data Scientist we are probably used to query our data using SQL, I have noticed a very common problem in Junior but also Senior analytics professional, they think that a table in a database is at the very beginning of the process, this is because it is their process within a bigger one; sometimes it is just a matter of point of view, for a Data Engineer a table in a database is at the end of the process!

If we want to have a broader view we need to look out of the Analytics bubble, this is the reason why i started this project.

When scraping data, the most important part is the website structure analysis, in fact I have spent many hours navigating the website to get as much info as possible. I would like to guide you through some of the encountered problems.

Javascript generated content

For complex websites you cannot directly scrape related HTML because most of the time there will be a JavaScript script generating informations. I have started considering Selenium to simulate a webdriver which would have fully loaded HTML info and then parse all this info with beautiful soup.

Selenium

The first problem of vivino website using Selenium was to fully load the wine list scrolling to the bottom, this require a scroll call with selenium until the end is reached, however by using this strategy you have to perform multiple calls and there may be the risk of being banned.

Find the most important JSON

When navigating the website I was not able to find a Json file with most of the info, this is because you get such file only after you start scrolling to the end of the first page. We now have a huge dictionary containing all details about a wine, even if some information in such dictionary are not displayed in the page you are looking.

Tables that will be created from this code

  1. vintage (this is the main table)
  2. region_most_used_grapes
  3. flavor
    • flavor_primary_keywords
    • flavor_secondary_keywords
  4. prices
  5. food
  6. grapes

Start of The Project

Import Libraries

import time
import pycurl
import certifi
from io import BytesIO
import json
import pandas as pd

Performing cURL Request


%%time
n = 0
pages_available = True
wines_collection = []

for i in range(100):
# Creating a buffer as the cURL is not allocating a buffer for the network response
    buffer = BytesIO()

# Initialising the curl object
    curl = pycurl.Curl()

# Setting the request URL, the user agent, and the buffer for the response
    url = f'https://www.vivino.com/api/explore/explore?country_code=IT¤cy_code=EUR&grape_filter=varietal&min_rating=1&order_by=price&order=desc&price_range_max=500+&price_range_min=0&wine_type_ids%5B%5D=1&wine_type_ids%5B%5D=2&wine_type_ids%5B%5D=3&wine_type_ids%5B%5D=24&wine_type_ids%5B%5D=7&wine_type_ids%5B%5D=4&page={n}&language=en'
    curl.setopt(pycurl.URL, url)
    curl.setopt(pycurl.USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0")
    curl.setopt(pycurl.WRITEDATA, buffer)
    curl.setopt(pycurl.CAINFO, certifi.where())

# Performing the actual request
    curl.perform()

# Accessing the response code and adding 1 to n only if successful, this step is not necessary it allows you to check each step
    if curl.getinfo(pycurl.RESPONSE_CODE) == 200:
        #print(f'Success : {n} {curl.getinfo(pycurl.RESPONSE_CODE)}', )
        n +=1
    else:
        print(f'Not Success{curl.getinfo(pycurl.RESPONSE_CODE)}')

# Ending the session and freeing the resources
    curl.close()
    
#Convert Buffer from Byte to string to Dict
    page_dict = json.loads(buffer.getvalue().decode())

#page_dict is now a dictionaire containing three main dictionaries, the one we want is "explore_vintage", it contains another dict "matches"
#matches is a list of dict
    for i in page_dict['explore_vintage']['matches']:
        wines_collection.append(i)
    
    if n%10 == 0:
        print(len(wines_collection))
Output:
CPU times: user 4.74 s, sys: 541 ms, total: 5.28 s
Wall time: 1min 23s

Check Buffer and Navigate the Tree

This step is purely to show how to transformin buffer into a dictionary from Bytes -> String -> Dict and print each result


#buffer getvalue() now store informations we need in a bytes format
print(type(buffer.getvalue()))

#hence we will decode bytes into string
string = buffer.getvalue().decode()
print(type(string))

#at this stage we can  convert our string into a dictionary using json library
wines_info = json.loads(string)
print(type(wines_info))
Output:
<class 'bytes'>
<class 'str'>
<class 'dict'>

We can now have a look at our newly created dictionary structure


#We have three main keys
print(wines_info.keys())
#explore vintage has more keys
print(wines_info['explore_vintage'].keys())
#matches provides a list of values, let's see keys of the first value of this list
print(wines_info['explore_vintage']['matches'][0].keys())
print(wines_info['explore_vintage']['matches'][0]['vintage'].keys())
Output:
dict_keys(['explore_vintage', 'selected_filters', 'e'])
dict_keys(['market', 'records_matched', 'matches', 'bottle_type_errors', 'records'])
dict_keys(['vintage', 'price', 'prices'])
dict_keys(['id', 'seo_name', 'name', 'statistics', 'image', 'wine', 'year', 'grapes', 'has_valid_ratings'])

Build Pandas Dataframe

We will build a dataframe using one main pandas function, json_normalize, the first json flttening will be the easiest one, some other will require to add some parameter to json_normalize function, while at the very end we will use pure python code to manipulate our json.

Once you flatten Json and create a Dataframe with json_normalize, you can notice that there are still some columns containing a dictionary, this is because the json we just downloaded has many layers with which we can build multiple tables, each table can the be connected with primary keys like in a relational database.

vintage = pd.json_normalize(wines_collection)
vintage.head(2)
prices vintage.id vintage.seo_name vintage.name vintage.statistics.status vintage.statistics.ratings_count vintage.statistics.ratings_average vintage.statistics.labels_count vintage.statistics.wine_ratings_count vintage.statistics.wine_ratings_average vintage.statistics.wine_status vintage.image.location vintage.image.variations.bottle_large vintage.image.variations.bottle_medium vintage.image.variations.bottle_medium_square vintage.image.variations.bottle_small vintage.image.variations.bottle_small_square vintage.image.variations.label vintage.image.variations.label_large vintage.image.variations.label_medium vintage.image.variations.label_medium_square vintage.image.variations.label_small_square vintage.image.variations.large vintage.image.variations.medium vintage.image.variations.medium_square vintage.image.variations.small_square vintage.wine.id vintage.wine.name vintage.wine.seo_name vintage.wine.type_id vintage.wine.vintage_type vintage.wine.is_natural vintage.wine.region.id vintage.wine.region.name vintage.wine.region.name_en vintage.wine.region.seo_name vintage.wine.region.country.code vintage.wine.region.country.name vintage.wine.region.country.native_name vintage.wine.region.country.seo_name vintage.wine.region.country.currency.code vintage.wine.region.country.currency.name vintage.wine.region.country.currency.prefix vintage.wine.region.country.currency.suffix vintage.wine.region.country.regions_count vintage.wine.region.country.users_count vintage.wine.region.country.wines_count vintage.wine.region.country.wineries_count vintage.wine.region.country.most_used_grapes vintage.wine.region.background_image vintage.wine.winery.name vintage.wine.winery.seo_name vintage.wine.winery.status vintage.wine.winery.background_image vintage.wine.taste.structure.acidity vintage.wine.taste.structure.fizziness vintage.wine.taste.structure.intensity vintage.wine.taste.structure.sweetness vintage.wine.taste.structure.tannin vintage.wine.taste.structure.user_structure_count vintage.wine.taste.structure.calculated_structure_count vintage.wine.taste.flavor vintage.wine.statistics vintage.wine.style.id vintage.wine.style.seo_name vintage.wine.style.regional_name vintage.wine.style.varietal_name vintage.wine.style.name vintage.wine.style.image vintage.wine.style.background_image vintage.wine.style.interesting_facts vintage.wine.style.body vintage.wine.style.body_description vintage.wine.style.acidity vintage.wine.style.acidity_description vintage.wine.style.country.code vintage.wine.style.country.name vintage.wine.style.country.native_name vintage.wine.style.country.seo_name vintage.wine.style.country.currency.code vintage.wine.style.country.currency.name vintage.wine.style.country.currency.prefix vintage.wine.style.country.currency.suffix vintage.wine.style.country.regions_count vintage.wine.style.country.users_count vintage.wine.style.country.wines_count vintage.wine.style.country.wineries_count vintage.wine.style.country.most_used_grapes vintage.wine.style.wine_type_id vintage.wine.style.food vintage.wine.style.grapes vintage.wine.style.region.id vintage.wine.style.region.name vintage.wine.style.region.name_en vintage.wine.style.region.seo_name vintage.wine.style.region.country.code vintage.wine.style.region.country.name vintage.wine.style.region.country.seo_name vintage.wine.style.region.country.currency.code vintage.wine.style.region.country.currency.name vintage.wine.style.region.country.currency.prefix vintage.wine.style.region.country.currency.suffix vintage.wine.style.region.country.regions_count vintage.wine.style.region.country.users_count vintage.wine.style.region.country.wines_count vintage.wine.style.region.country.wineries_count vintage.wine.style.region.country.most_used_grapes vintage.wine.style.region.parent_id vintage.wine.style.region.background_image.location vintage.wine.style.region.background_image.variations.large vintage.wine.style.region.background_image.variations.medium vintage.wine.style.region.statistics.wineries_count vintage.wine.style.region.statistics.wines_count vintage.wine.style.region.statistics.sub_regions_count vintage.wine.style.region.statistics.parent_regions_count vintage.wine.style.parent_style_id vintage.wine.style.hidden vintage.wine.style.statistics.wines_count vintage.wine.style.statistics.aggregated_wines_count vintage.wine.style.vintage_mask vintage.wine.style.baseline_structure.acidity vintage.wine.style.baseline_structure.fizziness vintage.wine.style.baseline_structure.intensity vintage.wine.style.baseline_structure.sweetness vintage.wine.style.baseline_structure.tannin vintage.wine.has_valid_ratings vintage.year vintage.grapes vintage.has_valid_ratings price.id price.merchant_id price.amount price.discounted_from price.discount_percent price.type price.sku price.url price.visibility price.bottle_type_id price.currency.code price.currency.name price.currency.prefix price.currency.suffix price.xms price.xmphs price.price_group_id price.vfm_score price.bottle_type.id price.bottle_type.name price.bottle_type.short_name price.bottle_type.short_name_plural price.bottle_type.volume_ml vintage.wine.taste.structure vintage.wine.style vintage.wine.region.background_image.location vintage.wine.region.background_image.variations.large vintage.wine.region.background_image.variations.medium vintage.wine.winery.background_image.location vintage.wine.winery.background_image.variations.large vintage.wine.winery.background_image.variations.medium vintage.wine.winery.background_image.variations.small vintage.top_list_rankings vintage.wine.style.region.background_image vintage.wine.style.region
[{'id': 35134561,   'merchant_id': 18999, 'amoun... 156099154 domaine-de-la-romanee-conti-romanee-conti-gran... Domaine   de La Romanée-Conti Romanée-Conti Gran... Normal 47 4.7 715 9766 4.7 //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... 83912 Romanée-Conti   Grand Cru romanee-conti-grand-cru 1 0 False 2228 Romanée-Conti   Grand Cru Romanée-Conti romanee-conti fr Frankrijk France france EUR Euros None 1307 6550621 433651 68039 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... NaN Domaine   de La Romanée-Conti domaine-de-la-romanee-conti 0 NaN 4.101636 NaN 3.041284 1.959191 2.503016 354.0 159.0 [{'group':   'red_fruit', 'stats': {'count': 87,... None 283.0 burgundy-cote-de-nuits-red Bourgogne Rode   wijn uit Côte de Nuits Bourgogne   Côte de Nuits Rood NaN NaN None 3.0 Medium-bodied 3.0 Hoog fr Frankrijk France france EUR Euros NaN 1307.0 6550621.0 433651.0 68039.0 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... 1.0 [{'id':   4, 'name': 'Rundvlees', 'weight': 0.5,... [{'id':   14, 'name': 'Pinot noir', 'seo_name': ... 408.0 Côte de Nuits cote-de-nuits fr Frankrijk france EUR Euros NaN 1307.0 6550621.0 433651.0 68039.0 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... 385.0 //images.vivino.com/regions/backgrounds/oTEcWw... //thumbs.vivino.com/region_backgrounds/oTEcWwU... //thumbs.vivino.com/region_backgrounds/oTEcWwU... 766.0 7329.0 11.0 1.0 23.0 False 13612.0 13612.0 1< 4.5 NaN 3.0 1.0 2.5 False 2018 None True 35134561 18999 50960.0 NaN NaN vc MTH2204156 https://www.soloprodottiitaliani.it/spec_colle... 1 1 EUR Euros None 0.722997 False [xom] 0.0 1 Fles   (0.75 l) fles fles 750 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[{'id': 35695434,   'merchant_id': 35791, 'amoun... 1335502 domaine-de-la-romanee-conti-romanee-conti-gran... Domaine   de La Romanée-Conti Romanée-Conti Gran... Normal 134 4.6 1094 9766 4.7 //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... //images.vivino.com/thumbs/SzZEjTsMRlWz3yDyCIS... 83912 Romanée-Conti   Grand Cru romanee-conti-grand-cru 1 0 False 2228 Romanée-Conti   Grand Cru Romanée-Conti romanee-conti fr Frankrijk France france EUR Euros None 1307 6550621 433651 68039 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... NaN 11610 Domaine   de La Romanée-Conti domaine-de-la-romanee-conti 0 NaN 4.101636 NaN 3.041284 1.959191 2.503016 354.0 159.0 [{'group':   'red_fruit', 'stats': {'count': 87,... None 283.0 burgundy-cote-de-nuits-red Bourgogne Rode   wijn uit Côte de Nuits Bourgogne   Côte de Nuits Rood NaN NaN None None None 3.0 Medium-bodied 3.0 Hoog fr Frankrijk France france EUR Euros NaN 1307.0 6550621.0 433651.0 68039.0 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... 1.0 [{'id':   4, 'name': 'Rundvlees', 'weight': 0.5,... [{'id':   14, 'name': 'Pinot noir', 'seo_name': ... 408.0 Côte de Nuits cote-de-nuits fr Frankrijk France france EUR Euros NaN 1307.0 6550621.0 433651.0 68039.0 [{'id':   14, 'name': 'Pinot Noir', 'seo_name': ... 385.0 //images.vivino.com/regions/backgrounds/oTEcWw... //thumbs.vivino.com/region_backgrounds/oTEcWwU... //thumbs.vivino.com/region_backgrounds/oTEcWwU... 766.0 7329.0 11.0 1.0 23.0 False 13612.0 13612.0 1< 4.5 NaN 3.0 1.0 2.5 False 1982 None True 35695434 35791 18500.0 NaN NaN vc 8351718768976 https://rarest-wines.com/products/romanee-cont... 1 1 EUR Euros None 0.000000 False [xom] 0.0 1 Fles   (0.75 l) fles fles 750 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Nested JSON

From the table above, you may have noticed that the scraped Json is not fully transformed into a table, indeed we still have some list of dictionaries nested in our created dataframe. Check the columns "Prices", each wine may have multiple prices offered by different vendor, this is the reason why such variable is more complex than the others. There are other variables like this, from such variables we can create other tables that can be interpeted as dimensional tables.

It seems that from our simple Json we are able to build a proper schema.

Before we have used json_normalize just by giving json file as imput parameter, when dealing with nested json we can pass a list to the record_ path argument, such list represents nested json keys in order until the last one (where what you need is store) is reached.

Another parameter called meta can be provided, in this case we specify the location of a single value related to that record, in this case i will use the wine id so that we will be able in the future to join back all the information related to a wine.

Transform: vintage.wine.region.country.most_used_grapes

note how most_used_grapes is the last value of the list in record_ path

region_most_used_grapes = pd.json_normalize(wines_collection,
                                     record_path = ['vintage', 'wine', 'region', 'country', 'most_used_grapes'],
                                     record_prefix = 'vintage_wine_region_country_MostUsedGrapes_',
                                     meta = [['vintage', 'id']])
region_most_used_grapes.head(10)
vintage_wine_region_country_MostUsedGrapes_id vintage_wine_region_country_MostUsedGrapes_name vintage_wine_region_country_MostUsedGrapes_seo_name vintage_wine_region_country_MostUsedGrapes_has_detailed_info vintage_wine_region_country_MostUsedGrapes_wines_count vintage_wine_region_country_MostUsedGrapes_parent_grape_id vintage.id
14 Pinot   Noir pinot-noir True 572334 NaN 156099154
5 Chardonnay chardonnay True 604208 NaN 156099154
10 Merlot merlot True 566719 NaN 156099154
14 Pinot   Noir pinot-noir True 572334 NaN 1335502
5 Chardonnay chardonnay True 604208 NaN 1335502
10 Merlot merlot True 566719 NaN 1335502
15 Riesling riesling True 262136 NaN 156856757
299 Spätburgunder spatburgunder False 50164 14.0 156856757
355 Weissburgunder weissburgunder False 33464 36.0 156856757
14 Pinot   Noir pinot-noir True 572334 NaN 2161228

Transform: vintage.wine.tase.flavor

flavor contains another dictionary that we will transform later without using json_normalize but pure python

flavor = pd.json_normalize(wines_collection,
                                     record_path = ['vintage', 'wine', 'taste', 'flavor'], 
                                     meta = [['vintage', 'id']])
flavor.head(10)
group primary_keywords secondary_keywords stats.count stats.score vintage.id
red_fruit [{'id':   348, 'name': 'raspberry', 'count': 34}... [{'id':   38, 'name': 'black cherry', 'count': 1... 87 11822 156099154
earth [{'id':   242, 'name': 'leather', 'count': 33}, ... [{'id':   134, 'name': 'dark chocolate', 'count'... 73 8582 156099154
oak [{'id':   292, 'name': 'oak', 'count': 11}, {'id... [{'id':   242, 'name': 'leather', 'count': 33}, ... 69 4570 156099154
black_fruit [{'id':   334, 'name': 'plum', 'count': 14}, {'i... [{'id':   341, 'name': 'prune', 'count': 2}] 49 6034 156099154
non_oak [{'id':   62, 'name': 'brioche', 'count': 1}, {'... [{'id':   117, 'name': 'coffee', 'count': 7}, {'... 30 603 156099154
spices [{'id':   320, 'name': 'pepper', 'count': 6}, {'... [{'id':   434, 'name': 'vanilla', 'count': 2}, {... 21 2195 156099154
floral [{'id':   436, 'name': 'violet', 'count': 5}, {'... [{'id':   381, 'name': 'savory', 'count': 2}] 19 2050 156099154
microbio [{'id':   238, 'name': 'lager', 'count': 1}, {'i... [{'id':   284, 'name': 'mushroom', 'count': 9}, ... 12 700 156099154
citrus_fruit [{'id':   298, 'name': 'orange', 'count': 3}, {'... NaN 7 700 156099154
dried_fruit [{'id':   341, 'name': 'prune', 'count': 2}, {'i... [{'id':   229, 'name': 'jam', 'count': 1}] 5 394 156099154

Transform: prices

prices = pd.json_normalize(wines_collection,
                                     record_path = 'prices',
                                     record_prefix = 'prices_', meta = [['vintage', 'id']])
prices.head(10)
prices_id prices_merchant_id prices_amount prices_discounted_from prices_discount_percent prices_type prices_sku prices_url prices_visibility prices_bottle_type_id prices_xms prices_xmphs prices_price_group_id prices_vfm_score prices_currency.code prices_currency.name prices_currency.prefix prices_currency.suffix prices_bottle_type.id prices_bottle_type.name prices_bottle_type.short_name prices_bottle_type.short_name_plural prices_bottle_type.volume_ml vintage.id
35134561 18999 50960.0 NaN NaN vc MTH2204156 https://www.soloprodottiitaliani.it/spec_colle... 1 1 0.722997 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 156099154
35695434 35791 18500.0 NaN NaN vc 8351718768976 https://rarest-wines.com/products/romanee-cont... 1 1 0.000000 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 1335502
34993404 18999 15567.2 NaN NaN vc https://www.soloprodottiitaliani.it/spec_colle... 1 1 0.722513 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 156856757
35695443 35791 15000.0 NaN NaN vc 8353245561168 https://rarest-wines.com/products/romanee-cont... 1 1 0.000000 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 2161228
35695467 35791 11000.0 NaN NaN vc 8362932011344 https://rarest-wines.com/products/romanee-cont... 1 1 0.000000 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 1764869
34589484 34568 10240.0 NaN NaN vc 8512976093452-44827393491212 https://johnstonfinewines.com/products/domaine... 1 1 0.000000 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 1908340
31862266 18999 9031.0 NaN NaN vc MTH2204414 https://www.soloprodottiitaliani.it/spec_colle... 1 1 0.722513 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 23828669
31862268 18999 9000.0 NaN NaN vc MTH2204416 https://www.soloprodottiitaliani.it/spec_colle... 1 1 0.722513 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 150850054
35226807 18999 8879.0 NaN NaN vc MTH2400676 https://www.soloprodottiitaliani.it/spec_colle... 1 1 0.722513 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 17476587
34589482 34568 8630.0 NaN NaN vc 8512987070732-44827434090764 https://johnstonfinewines.com/products/domaine... 1 1 0.000000 False [xom] 0.0 EUR Euros None 1 Fles   (0.75 l) fles fles 750 1204536

Nested JSON without json_normalize

what happen if in some cases and entire index is missing on your json? you can't use json_normalize, this is a great news because we can start thinking a little bit more to build our own json_ normalize function, I called it dict_ to_ table a very original name.

data is the dataframe where nested values are stored and specified with the parameter column, key is where the primary key of the wine is stored.

the following function loop through the nested dictionary and associate the wine key with function zip. the results is a list of dictionaries that can be easily read by pandas.DataFrame.

def dict_to_table(data, column, key):
    collect = []
    for i,j in zip(data[column], data[key]):
        try:
            for n in i:
                n[key] = j
                collect.append(n)
        except:
            pass
    return collect
flavor_primary_keywords = pd.DataFrame(dict_to_table( data = flavor, 
		column = 'primary_keywords', 
		key = 'vintage.id' ))
flavor_primary_keywords.head(10)
id name count vintage.id
348 raspberry 34 156099154
93 cherry 28 156099154
400 strawberry 25 156099154
354 red   fruit 18 156099154
352 red   cherry 5 156099154
156 earthy 23 1335502
284 mushroom 9 1335502
384 smoke 7 1335502
276 minerals 5 1335502
430 truffle 2 1335502

If you are reading this, it means that you went thorough the entire project, do no hesitate to contact me or schedule a meeting if you have any advice or interesting problems we can try to solve together!

Articles Updates

04

There are currently no Active Newsletter

By clicking Sign Up you're confirming that you agree with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.