How to Build Data Pipelines for Machine Learning | by Shaw Talebi

We begin by importing a couple of libraries and a secret YouTube API key. Should you don’t have an API key, you’ll be able to create one following this guide.

import requests
import json
import polars as pl
from my_sk import my_keyfrom youtube_transcript_api import YouTubeTranscriptApi

Subsequent, we’ll outline variables to assist us extract video knowledge from the YouTube API. Right here, I specify the ID of my YouTube channel and the API URL, initialize page_token, and create an inventory for storing video knowledge.

# outline channel ID
channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA'# outline url for API
url = 'https://www.googleapis.com/youtube/v3/search'
# initialize web page token
page_token = None
# intialize listing to retailer video knowledge
video_record_list = []

The following chunk of code is perhaps scary, so I’ll clarify what’s taking place first. We’ll carry out GET requests to YouTube’s search API. This is rather like trying to find movies on YouTube, however as a substitute of utilizing the UI, we’ll carry out searches programmatically.

Since search outcomes are restricted to 50 per web page, we have to recursively carry out searches to return each video that matches the search standards. Right here’s what that appears like in Python code.

# extract video knowledge throughout a number of search outcome pageswhereas page_token != 0:
# outline parameters for API name
params = {'key': my_key, 'channelId': channel_id, 
'half': ["snippet","id"], 'order': "date", 
'maxResults':50, 'pageToken': page_token}
# make get request
response = requests.get(url, params=params)
# append video knowledge from web page outcomes to listing
video_record_list += getVideoRecords(response)
strive:
# seize subsequent web page token
page_token = json.hundreds(response.textual content)['nextPageToken']
besides:
# if no subsequent web page token kill whereas loop
page_token = 0

getVideoRecords() is a user-defined perform that extracts the related data from an API response.

# extract video knowledge from single search outcome web pagedef getVideoRecords(response: requests.fashions.Response) -> listing:
"""
Perform to extract YouTube video knowledge from GET request response
"""
# initialize listing to retailer video knowledge from web page outcomes
video_record_list = []
for raw_item in json.hundreds(response.textual content)['items']:
# solely execute for youtube movies
if raw_item['id']['kind'] != "youtube#video":
proceed
# extract related knowledge
video_record = {}
video_record['video_id'] = raw_item['id']['videoId']
video_record['datetime'] = raw_item['snippet']['publishedAt']
video_record['title'] = raw_item['snippet']['title']
# append report to listing
video_record_list.append(video_record)
return video_record_list

Now that we now have details about all my YouTube movies let’s extract the mechanically generated captions. To make the video IDs simpler to entry, I’ll retailer the video knowledge in a Polars dataframe.

# retailer knowledge in polars dataframe
df = pl.DataFrame(video_record_list)
print(df.head())

To drag the video captions, I’ll use the youtube_transcript_api Python library. I’ll loop by every video ID within the dataframe and extract the related transcript.

# intialize listing to retailer video captions
transcript_text_list = []# loop by every row of dataframe
for i in vary(len(df)):
# attempt to extract captions
strive:
# get transcript
transcript = YouTubeTranscriptApi.get_transcript(df['video_id'][i])
# extract textual content transcript
transcript_text = extract_text(transcript)
# if not captions obtainable set as n/a
besides:
transcript_text = "n/a"
# append transcript textual content to listing
transcript_text_list.append(transcript_text)

Once more, I exploit a user-defined perform known as extract_text() to extract the mandatory data from the API.

def extract_text(transcript: listing) -> str:
"""
Perform to extract textual content from transcript dictionary
"""text_list = [transcript[i]['text'] for i in vary(len(transcript))]
return ' '.be a part of(text_list)

Then we will add the transcripts for every video to the dataframe.

# add transcripts to dataframe
df = df.with_columns(pl.Collection(identify="transcript", values=transcript_text_list)) 
print(df.head())

Head of dataframe with transcripts. Picture by creator.

With the info extracted, we will remodel it so it’s prepared for the downstream use case. This requires some exploratory knowledge evaluation (EDA).

Handing duplicates

An excellent start line for EDA is to look at the variety of distinctive rows and components in every column. Right here, we anticipated every row to be uniquely recognized by the video_id. Moreover, every column mustn’t have repeating components aside from movies for which no transcript was obtainable, which we set as “n/a”.

Right here’s some code to probe that data. We will see from the output the info match our expectations.

# form + distinctive values
print("form:", df.form)
print("n distinctive rows:", df.n_unique())
for j in vary(df.form[1]):
print("n distinctive components (" + df.columns[j] + "):", df[:,j].n_unique())### output
# form: (84, 4)
# n distinctive rows: 84
# n distinctive components (video_id): 84
# n distinctive components (datetime): 84
# n distinctive components (title): 84
# n distinctive components (transcript): 82

Test dtypes

Subsequent, we will look at the info forms of every column. Within the picture above, we noticed that every one columns are strings.

Whereas that is acceptable for the video_id, title, and transcript, this isn’t a good selection for the datetime column. We will change this kind within the following method.

# change datetime to Datetime dtype
df = df.with_columns(pl.col('datetime').forged(pl.Datetime))
print(df.head())

Head of dataframe after updating datetime dtype. Picture by creator.

Dealing with particular characters

Since we’re working with textual content knowledge, it’s essential to look out for particular character strings. This requires a little bit of guide skimming of the textual content, however after a couple of minutes, I discovered 2 particular circumstances: ’ → ' and & → &

Within the code under, I changed these strings with the suitable characters and altered “sha” to “Shaw”.

# listing all particular strings and their replacements
special_strings = ['&#39;', '&amp;', 'sha ']
special_string_replacements = ["'", "&", "Shaw "]# substitute every particular string showing in title and transcript columns
for i in vary(len(special_strings)):
df = df.with_columns(df['title'].str.substitute(special_strings[i], 
special_string_replacements[i]).alias('title'))
df = df.with_columns(df['transcript'].str.substitute(special_strings[i], 
special_string_replacements[i]).alias('transcript'))

For the reason that dataset right here may be very small (84 rows and 4 columns, ~900k characters), we will retailer the info immediately within the mission listing. This may be performed in a single line of code utilizing the write_parquet() technique in Polars. The ultimate file dimension is 341 KB.

# write knowledge to file
df.write_parquet('knowledge/video-transcripts.parquet')

Right here, we mentioned the fundamentals of constructing knowledge pipelines within the context of Full Stack Information Science and walked by a concrete instance utilizing real-world knowledge.

Within the subsequent article of this collection, we’ll proceed happening the info science tech stack and focus on how we will use this knowledge pipeline to develop a semantic search system for my YouTube movies.

Extra on this collection 👇

Source link

Bằng chứng học tập trong học máy/AI | của Rômulo Pauliv | Tháng 5 năm 2024

Bằng chứng học tập trong học máy/AI | của Rômulo Pauliv | Tháng 5 năm 2024

Bằng chứng học tập trong học máy/AI | của Rômulo Pauliv | Tháng 5 năm 2024

Can You Deduct Health Insurance Premiums? Exploring Eligibility, Limitations, and Potential Savings

FunSearch: Making new discoveries in mathematical sciences using Large Language Models

Solar 10.7B: Comparing Its Performance to Other Notable LLMs

12 RAG Pain Points and Proposed Solutions | by Wenqi Glantz | Jan, 2024

2023 in Review: Recapping the Post-ChatGPT Era and What to Expect for 2024 | by Leonie Monigatti | Dec, 2023

Most Popular

Can You Deduct Health Insurance Premiums? Exploring Eligibility, Limitations, and Potential Savings

FunSearch: Making new discoveries in mathematical sciences using Large Language Models

Solar 10.7B: Comparing Its Performance to Other Notable LLMs

Our Picks

AI giống nhau + Kế hoạch triển khai khác nhau = Đạo đức khác nhau

Bằng chứng học tập trong học máy/AI | của Rômulo Pauliv | Tháng 5 năm 2024

Quản lý sản xuất là gì? Sự nghiệp, Chức năng, Ví dụ và hơn thế nữa

How to Build Data Pipelines for Machine Learning | by Shaw Talebi | May, 2024

Handing duplicates

Test dtypes

Dealing with particular characters

Related

Related Posts