Building Scalable AI Apps with PySpark & Hugging Face Transformers
For Developers

March 25, 2024

Building Scalable AI Apps with PySpark & Hugging Face Transformers

In the constantly shifting landscape of sustainable business, extracting valuable insights from extensive data reservoirs is both a challenge and a necessity. Meet the powerful pairing of PySpark and Hugging Face Transformers, two cutting-edge technologies that embrace the scale of big data, empowering leaders to unlock their latent potential. Let’s embark on this exponentially growing tech journey and envision a world where scalability meets artificial intelligence seamlessly, propelling organizations towards unprecedented innovation and efficiency. In today’s data-driven era, where data isn't just processed, it's harnessed with precision, transforming challenges into opportunities and paving the way for engineering leaders to chart new territories in the age of information enlightenment. This article unravels the secrets held within the vast expanse of data and decodes the synergy between PySpark and Hugging Face Transformers, shaping the future of scalable AI applications. Let’s dive deeply into the area of PySpark and Hugging Transformers for building scalable AI to utilize the potential power of data-driven decision-making.

What is PySpark?

PySpark functions as the Python API counterpart to Apache Spark, providing a smooth interface that enables developers to utilize Python and execute SQL-like commands for the manipulation and analysis of data in a distributed processing setting. As its nomenclature implies, PySpark is a fusion of Python and Spark.

How is PySpark useful?

The utility of PySpark becomes evident, especially for enterprises grappling with terabytes of data within a robust big data framework like Apache Spark. Mere proficiency in Python and R frameworks falls short when dealing with extensive datasets necessitating manipulation within a distributed processing system—a prerequisite for most data-centric organizations. PySpark serves as an excellent entry point, offering a straightforward syntax easily grasped by those already acquainted with Python.

Companies are adopting PySpark because of its remarkable efficiency in processing large-scale data. Outpacing libraries such as Pandas and Dask in speed, PySpark excels in handling vast amounts of data, a task where others might falter. For instance, when confronted with petabytes of data, Pandas and Dask may stumble, but PySpark seamlessly manages the workload.

While Python code can be written on top of distributed systems like Hadoop, organizations often prefer Spark and its PySpark API due to their superior speed and real-time data handling capabilities. PySpark empowers developers to craft code that captures continuously updated data sources, a feat unattainable with Hadoop, which processes data solely in batch mode.

Although Apache Flink, with its PyFlink API, boasts faster performance than Spark, the latter's extensive tenure in the industry and robust community support render it a more reliable choice. Additionally, PySpark boasts fault tolerance, enabling data recovery post-failure. With in-memory computation stored in random access memory (RAM), PySpark can operate on machines lacking hard drives or SSDs, enhancing its adaptability and versatility.

What are Hugging Face Transformers?

Hugging Face Transformers is an open-source toolbox for natural language processing that is usually used. It includes various trained models and tools that work seamlessly with state-of-the-art NLP models such as BERT, GPT-2, and RoBERTa. This toolkit is incredibly user-friendly, making it easy for researchers and programmers to leverage and explore pre-trained models for various tasks, including summarization, translating between languages, text classification, and more.

What are Transformers?

Transformers are an expansive storehouse of pre-trained, advanced models crafted for many applications, covering natural language processing (NLP), computer vision, and audio and speech processing tasks. Alongside its role as a host for Transformer models, the repository also embraces models beyond the Transformer paradigm. This includes up-to-date convolutional networks strategically crafted to address challenges in computer vision.

Useful Features of Transformers

  1. Parallelization and Scalability:

Transformers, particularly the Attention Is All You Need (BERT) model, allow for parallelization during training. Unlike recurrent neural networks (RNNs), which process sequences sequentially, transformers can process multiple tokens simultaneously. This parallelization significantly speeds up training and inference, making them highly scalable.

  1. Self-Attention Mechanism:

The core innovation in transformers lies in their self-attention mechanism. It enables the model to weigh the importance of different input tokens when generating an output token. This mechanism captures long-range dependencies and context, making transformers effective for machine translation and sentiment analysis tasks.

  1. Pre-trained Language Models:

Transformers are often pre-trained on massive amounts of text data. These pre-trained models learn rich representations of language, which can then be fine-tuned for specific downstream tasks. The Hugging Face Transformers library provides access to various pre-trained models, such as BERT, GPT, and RoBERTa.

  1. Transfer Learning:

Transfer learning is a key advantage of transformers. By leveraging them, developers can fine-tune pre-trained models on smaller, task-specific datasets. This approach saves computational resources and yields impressive results even with limited labelled data.

  1. Multimodal Applications:

Transformers are not limited to text data. They excel in handling multimodal inputs, such as combining text and images. For instance, the Vision-and-Language Pre-training (ViLBERT) model integrates visual and textual information, enabling applications like image captioning and visual question answering.

  1. Attention Visualization:

Transformers provide interpretability through attention maps. These maps reveal which input tokens contribute most to the output. Researchers and practitioners can analyze these attention patterns to understand model behaviour and identify biases.

Thus, transformers revolutionize natural language understanding by combining parallelization, self-attention, pre-training, and transfer learning. Their impact extends beyond text to various domains, making them indispensable tools for unlocking big data insights.

Real-Life Case Study: Deriving Business Insights with PySpark

Hypothetical Case study: Uncovering Business Insights using PySpark from an e-commerce dataset 

PySpark proves invaluable in extracting valuable business intelligence from extensive datasets. Let's explore a practical scenario where a retail enterprise aims to gain insights into customer purchasing patterns.

Data Source: (https://www.kaggle.com/datasets/carrie1/ecommerce-data)

Envision a retail company managing a vast dataset of customer transactions. This dataset encompasses details such as:

  • InvoiceNo: A distinctive identifier for each customer invoice.
  • StockCode: A unique identifier for each stocked item.
  • Description: The product acquired by the customer.
  • Quantity: The quantity of each item purchased in a single invoice.
  • InvoiceDate: The date of purchase.
  • UnitPrice: The cost of a single unit of each item.
  • CustomerID: An exclusive identifier assigned to each user.
  • Country: The origin country of the business transaction.

Problem Statement: The company aims to pinpoint patterns in customer behaviour to optimize marketing strategies and product offerings. Specifically, their focus includes:

1. Popular products and categories: Identifying consistently purchased products and categories helps understand customer preferences and guides stocking decisions.

2. Customer segmentation: Grouping customers based on their purchase history facilitates personalized marketing campaigns and promotions.

3. Seasonal trends: Analyzing purchase patterns across various seasons unveils fluctuations in demand for specific products.

Implementation using PySpark

To conduct Large-Scale Analysis, PySpark's distributed processing capabilities empower us to perform aggregations and calculations across the entire dataset efficiently. Here's how PySpark can address specific business inquiries:

1. Identifying Popular Products and Categories: Group the data by product or category and calculate total sales to pinpoint the best-selling items.

2. Customer Segmentation: Utilize PySpark's Machine Learning Library (MLlib) to cluster customers based on their purchase history, revealing groups with similar buying behaviours.

3. Analyzing Seasonal Trends: Group the data by month or quarter, then analyse sales figures to unveil seasonal variations in demand.

Experience the power to deriving data insights using following python code: 

Step 1: Install the Library 

pip install pyspark

Step 2: Import necessary libraries

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, sum, year, month

Step 3: Create Spark-Session

spark = SparkSession.builder.appName("EcommerceInsights").getOrCreate()

Step 4: Read data from CSV with appropriate schema definition

df = spark.read.csv("ecommerce.csv", header=True, inferSchema=True)

Step 5: Data Preprocessing

#Convert InvoiceDate to timestamp type

df = df.withColumn("InvoiceDate", df["InvoiceDate"].cast("timestamp"))

# Derive year and month for further analysis

df = df.withColumn("Year", year(col("InvoiceDate")))

df = df.withColumn("Month", month(col("InvoiceDate")))

Step 6: Deriving Insights from data using PySPrak

Question 1: Popular Products and Categories (Total Revenue)

# Option 1: Group by StockCode (individual product)

product_sales = df.groupBy("StockCode") \

  .agg(sum(col("Quantity") * col("UnitPrice")).alias("TotalRevenue")) \

  .orderBy(col("TotalRevenue").desc())

# Option 2: Group by description (considering variants within product)

product_desc_sales = df.groupBy("Description") \

  .agg(sum(col("Quantity") * col("UnitPrice")).alias("TotalRevenue")) \

  .orderBy(col("TotalRevenue").desc())

Question 2: Customer Segmentation (Total Spent per Customer)

# Group by CustomerID and calculate total spending

customer_spending = df.groupBy("CustomerID") \

  .agg(sum(col("Quantity") * col("UnitPrice")).alias("TotalSpent"))

# Further analysis with clustering or machine learning models can be done here

 Question 3: Seasonal Trends (Monthly Revenue)

Group by Year and Month and calculate total revenue

monthly_revenue = df.groupBy("Year", "Month") \

  .agg(sum(col("Quantity") * col("UnitPrice")).alias("TotalRevenue")) \

  .orderBy(col("Year"), col("Month"))

Step 7: Display the results.

product_sales.show()  # Option 1 for Product Analysis

product_desc_sales.show()  # Option 2 (uncomment for Description Analysis)

customer_spending.show()  # Customer Analysis

monthly_revenue.show()  # Seasonal Trends

Step 8: Stop SparkSession

spark.stop()

Output of the Above Code

+---------+------------------+

|StockCode|      TotalRevenue|

+---------+------------------+

|      DOT|206245.47999999998|

|    22423|164762.19000000003|

|    47566| 98302.97999999997|

|   85123A| 97894.50000000001|

|   85099B| 92356.03000000003|

|    23084| 66756.58999999985|

|     POST|          66230.64|

|    22086|          63791.94|

|    84879| 58959.73000000005|

|    79321|53768.060000000005|

|    22502| 51041.37000000002|

|    22197|50987.469999999994|

|    23298| 42700.02000000001|

|    22386| 41619.66000000003|

|    23203| 40991.38000000004|

|    21137|          40596.96|

|    22720| 37413.43999999998|

|    23284| 36565.38999999998|

|    22960| 36116.08999999999|

|    82484|35859.270000000004|

+---------+------------------+

only showing top 20 rows

 

+--------------------+------------------+

|         Description|      TotalRevenue|

+--------------------+------------------+

|      DOTCOM POSTAGE|206245.47999999998|

|REGENCY CAKESTAND...|164762.19000000003|

|WHITE HANGING HEA...|          99668.47|

|       PARTY BUNTING| 98302.97999999997|

|JUMBO BAG RED RET...| 92356.03000000003|

|  RABBIT NIGHT LIGHT| 66756.58999999985|

|             POSTAGE|          66230.64|

|PAPER CHAIN KIT 5...|          63791.94|

|ASSORTED COLOUR B...| 58959.73000000005|

|       CHILLI LIGHTS|53768.060000000005|

|      SPOTTY BUNTING|          42065.32|

|JUMBO BAG PINK PO...| 41619.66000000003|

|BLACK RECORD COVE...|          40596.96|

|PICNIC BASKET WIC...|           39619.5|

|SET OF 3 CAKE TIN...| 37413.43999999998|

|DOORMAT KEEP CALM...| 36565.38999999998|

|JAM MAKING SET WI...| 36116.08999999999|

|WOOD BLACK BOARD ...|35859.270000000004|

|LUNCH BAG RED RET...| 34897.31000000001|

|      POPCORN HOLDER| 33969.45999999999|

+--------------------+------------------+

only showing top 20 rows

 

+----------+------------------+

|CustomerID|        TotalSpent|

+----------+------------------+

|     17420| 598.8299999999999|

|     16861|            151.65|

|     16503|1421.4299999999998|

|     15727|           5178.96|

|     17389|31300.079999999998|

|     15100|             635.1|

|     12471|          18740.92|

|     16916| 576.2599999999999|

|     17809| 4627.619999999999|

|     15738|4788.7699999999995|

|     17044|            897.43|

|     17223|426.78999999999996|

|     18043|            559.51|

|     13329| 740.3999999999999|

|     17950|            472.12|

|     15535| 459.8999999999999|

|     14713|2664.2599999999998|

|     16565|             173.7|

|     13225| 6083.040000000001|

|     14944|           5842.95|

+----------+------------------+

only showing top 20 rows

Conclusion

By leveraging PySpark's capabilities, businesses can unlock valuable insights from big data, ultimately leading to better decision-making, improved customer experiences, and increased profitability. This case study serves as a starting point, and PySpark's versatility allows for application across various industries with specific data sources and business challenges.