diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb new file mode 100644 index 0000000000..2446f11fc8 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb @@ -0,0 +1,1386 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "muD_5mFUYB-x" + }, + "source": [ + "# Job recommendation system\n", + "\n", + "The code sample contains the following parts:\n", + "\n", + "1. Data exploration and visualization\n", + "2. Data cleaning/pre-processing\n", + "3. Fake job postings identification and removal\n", + "4. Job recommendation by showing the most similar job postings\n", + "\n", + "The scenario is that someone wants to find the best posting for themselves. They have collected the data, but he is not sure if all the data is real. Therefore, based on a trained model, as in this sample, they identify with a high degree of accuracy which postings are real, and it is among them that they choose the best ad for themselves.\n", + "\n", + "For simplicity, only one dataset will be used within this code, but the process using one dataset is not significantly different from the one described earlier.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GTu2WLQmZU-L" + }, + "source": [ + "## Data exploration and visualization\n", + "\n", + "For the purpose of this code sample we will use Real or Fake: Fake Job Postings dataset available over HuggingFace API. In this first part we will focus on data exploration and visualization. In standard end-to-end workload it is the first step. Engineer needs to first know the data to be able to work on it and prepare solution that will utilize dataset the best.\n", + "\n", + "Lest start with loading the dataset. We are using datasets library to do that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "saMOoStVs0-s", + "outputId": "ba4623b9-0533-4062-b6b0-01e96bd4de39" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"victor/real-or-fake-fake-jobposting-prediction\")\n", + "dataset = dataset['train']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To better analyze and understand the data we are transferring it to pandas DataFrame, so we are able to take benefit from all pandas data transformations. Pandas library provides multiple useful functions for data manipulation so it is usual choice at this stage of machine learning or deep learning project.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rRkolJQKtAzt" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = dataset.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see 5 first and 5 last rows in the dataset we are working on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "WYGIRBUJSl3N", + "outputId": "ccd4abaf-1b4d-4fbd-85c8-54408c4f9f8a" + }, + "outputs": [], + "source": [ + "df.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets print a concise summary of the dataset. This way we will see all the column names, know the number of rows and types in every of the column. It is a great overview on the features of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UtxA6fmaSrQ8", + "outputId": "e8a1ce15-88e8-487c-d05e-74c024aca994" + }, + "outputs": [], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point it is a good idea to make sure our dataset doen't contain any data duplication that could impact the results of our future system. To do that we firs need to remove `job_id` column. It contains unique number for each job posting so even if the rest of the data is the same between 2 postings it makes it different." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "f4LJCdKHStca", + "outputId": "b1db61e1-a909-463b-d369-b38c2349cba6" + }, + "outputs": [], + "source": [ + "# Drop the 'job_id' column\n", + "df = df.drop(columns=['job_id'])\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, the actual duplicates removal. We first pring the number of duplicates that are in our dataset, than using `drop_duplicated` method we are removing them and after this operation printing the number of the duplicates. If everything works as expected after duplicates removal we should print `0` as current number of duplicates in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Ow8SgJg2vJkB", + "outputId": "9a6050bf-f4bf-4b17-85a1-8d2980cd77ee" + }, + "outputs": [], + "source": [ + "# let's make sure that there are no duplicated jobs\n", + "\n", + "print(df.duplicated().sum())\n", + "df = df.drop_duplicates()\n", + "print(df.duplicated().sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tcpcjR8UUQCJ" + }, + "source": [ + "Now we can visualize the data from the dataset. First let's visualize data as it is all real, and later, for the purposes of the fake data detection, we will also visualize it spreading fake and real data.\n", + "\n", + "When working with text data it can be challenging to visualize it. Thankfully, there is a `wordcloud` library that shows common words in the analyzed texts. The bigger word is, more often the word is in the text. Wordclouds allow us to quickly identify the most important topic and themes in a large text dataset and also explore patterns and trends in textural data.\n", + "\n", + "In our example, we will create wordcloud for job titles, to have high-level overview of job postings we are working with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 544 + }, + "id": "c0jsAvejvzQ5", + "outputId": "7622e54f-6814-47e1-d9c6-4ee13173b4f4" + }, + "outputs": [], + "source": [ + "from wordcloud import WordCloud # module to print word cloud\n", + "from matplotlib import pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# On the basis of Job Titles form word cloud\n", + "job_titles_text = ' '.join(df['title'])\n", + "wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text)\n", + "\n", + "# Plotting Word Cloud\n", + "plt.figure(figsize=(10, 6))\n", + "plt.imshow(wordcloud, interpolation='bilinear')\n", + "plt.title('Job Titles')\n", + "plt.axis('off')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Different possibility to get some information from this type of dataset is by showing top-n most common values in given column or distribution of the values int his column.\n", + "Let's show top 10 most common job titles and compare this result with previously showed wordcould." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "0Ut0qo0ywv3_", + "outputId": "705fbbf0-4dc0-4ee1-d821-edccaff78a85" + }, + "outputs": [], + "source": [ + "# Get Count of job title\n", + "job_title_counts = df['title'].value_counts()\n", + "\n", + "# Plotting a bar chart for the top 10 most common job titles\n", + "top_job_titles = job_title_counts.head(10)\n", + "plt.figure(figsize=(10, 6))\n", + "top_job_titles.sort_values().plot(kind='barh')\n", + "plt.title('Top 10 Most Common Job Titles')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Job Titles')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can do the same for different columns, as `employment_type`, `required_experience`, `telecommuting`, `has_company_logo` and `has_questions`. These should give us reale good overview of different parts of our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "OaBkEWNLxkqK", + "outputId": "efbd9955-5630-4fdb-a6dd-f4ffe4b0a7d8" + }, + "outputs": [], + "source": [ + "# Count the occurrences of each work type\n", + "work_type_counts = df['employment_type'].value_counts()\n", + "\n", + "# Plotting the distribution of work types\n", + "plt.figure(figsize=(8, 6))\n", + "work_type_counts.sort_values().plot(kind='barh')\n", + "plt.title('Distribution of Work Types Offered by Jobs')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Work Types')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "5uTBPGXgyZEV", + "outputId": "d6c76b5f-25ce-4730-f849-f881315ca883" + }, + "outputs": [], + "source": [ + "# Count the occurrences of required experience types\n", + "work_type_counts = df['required_experience'].value_counts()\n", + "\n", + "# Plotting the distribution of work types\n", + "plt.figure(figsize=(8, 6))\n", + "work_type_counts.sort_values().plot(kind='barh')\n", + "plt.title('Distribution of Required Experience by Jobs')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Required Experience')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For employment_type and required_experience we also created matrix to see if there is any corelation between those two. To visualize it we created heatmap. If you think that some of the parameters can be related, creating similar heatmap can be a good idea." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 696 + }, + "id": "nonO2cHR1I-t", + "outputId": "3101b8b2-cf0a-413b-b0aa-eb2a3a96a582" + }, + "outputs": [], + "source": [ + "from matplotlib import pyplot as plt\n", + "import seaborn as sns\n", + "import pandas as pd\n", + "\n", + "plt.subplots(figsize=(8, 8))\n", + "df_2dhist = pd.DataFrame({\n", + " x_label: grp['required_experience'].value_counts()\n", + " for x_label, grp in df.groupby('employment_type')\n", + "})\n", + "sns.heatmap(df_2dhist, cmap='viridis')\n", + "plt.xlabel('employment_type')\n", + "_ = plt.ylabel('required_experience')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "mXdpeQFJ1VMu", + "outputId": "eb9a893f-5087-4dad-ceca-48a1dfeb0b02" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'telecommuting' column\n", + "telecommuting_counts = df['telecommuting'].value_counts()\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "telecommuting_counts.sort_values().plot(kind='barh')\n", + "plt.title('Counts of telecommuting vs Non-telecommuting')\n", + "plt.xlabel('count')\n", + "plt.ylabel('telecommuting')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "8kEu4IKcVmSV", + "outputId": "94ae873f-9178-4c63-e855-d677f135e552" + }, + "outputs": [], + "source": [ + "has_company_logo_counts = df['has_company_logo'].value_counts()\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "has_company_logo_counts.sort_values().plot(kind='barh')\n", + "plt.ylabel('has_company_logo')\n", + "plt.xlabel('Count')\n", + "plt.title('Counts of With_Logo vs Without_Logo')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "Esv8b51EVvxx", + "outputId": "40355e3f-fc6b-4b16-d459-922cfede2f71" + }, + "outputs": [], + "source": [ + "has_questions_counts = df['has_questions'].value_counts()\n", + "\n", + "# Plot the counts\n", + "plt.figure(figsize=(8, 6))\n", + "has_questions_counts.sort_values().plot(kind='barh')\n", + "plt.ylabel('has_questions')\n", + "plt.xlabel('Count')\n", + "plt.title('Counts Questions vs NO_Questions')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the job recommendations point of view the salary and location can be really important parameters to take into consideration. In given dataset we have salary ranges available so there is no need for additional data processing rather than removal of empty ranges but if the dataset you're working on has specific values, consider organizing it into appropriate ranges and only then displaying the result." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "6SQO5PVLy8vt", + "outputId": "f0dbdf21-af94-4e56-cd82-938b7258c26f" + }, + "outputs": [], + "source": [ + "# Splitting benefits by comma and creating a list of benefits\n", + "benefits_list = df['salary_range'].str.split(',').explode()\n", + "benefits_list = benefits_list[benefits_list != 'None']\n", + "benefits_list = benefits_list[benefits_list != '0-0']\n", + "\n", + "\n", + "# Counting the occurrences of each skill\n", + "benefits_count = benefits_list.str.strip().value_counts()\n", + "\n", + "# Plotting the top 10 most common benefits\n", + "top_benefits = benefits_count.head(10)\n", + "plt.figure(figsize=(10, 6))\n", + "top_benefits.sort_values().plot(kind='barh')\n", + "plt.title('Top 10 Salaries Range Offered by Companies')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Salary Range')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the location we have both county, state and city specified, so we need to split it into individual columns, and then show top 10 counties and cities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_242StA_UZTF" + }, + "outputs": [], + "source": [ + "# Split the 'location' column into separate columns for country, state, and city\n", + "location_split = df['location'].str.split(', ', expand=True)\n", + "df['Country'] = location_split[0]\n", + "df['State'] = location_split[1]\n", + "df['City'] = location_split[2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 959 + }, + "id": "HS9SH6p9UaJU", + "outputId": "6562e31f-6719-448b-c290-1a9610eb50c2" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'Country' column\n", + "Country_counts = df['Country'].value_counts()\n", + "\n", + "# Select the top 10 most frequent occurrences\n", + "top_10_Country = Country_counts.head(10)\n", + "\n", + "# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels\n", + "plt.figure(figsize=(14, 10))\n", + "sns.barplot(y=top_10_Country.index, x=top_10_Country.values)\n", + "plt.ylabel('Country')\n", + "plt.xlabel('Count')\n", + "plt.title('Top 10 Most Frequent Country')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 959 + }, + "id": "j_cPJl8pUcWT", + "outputId": "bb87ec2d-750d-45b0-f64f-ae4b84b00544" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'City' column\n", + "City_counts = df['City'].value_counts()\n", + "\n", + "# Select the top 10 most frequent occurrences\n", + "top_10_City = City_counts.head(10)\n", + "\n", + "# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels\n", + "plt.figure(figsize=(14, 10))\n", + "sns.barplot(y=top_10_City.index, x=top_10_City.values)\n", + "plt.ylabel('City')\n", + "plt.xlabel('Count')\n", + "plt.title('Top 10 Most Frequent City')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-R8hkAIjVF_s" + }, + "source": [ + "### Fake job postings data visualization \n", + "\n", + "What about fraudulent class? Let see how many of the jobs in the dataset are fake. Whether there are equally true and false offers, or whether there is a significant disproportion between the two. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 651 + }, + "id": "KJ5Aq2IizZ4r", + "outputId": "e1c10006-9f5a-4321-d90a-28915e02f8c3" + }, + "outputs": [], + "source": [ + "## fake job visualization\n", + "# Count the occurrences of unique values in the 'fraudulent' column\n", + "fraudulent_counts = df['fraudulent'].value_counts()\n", + "\n", + "# Plot the counts using a rainbow color palette\n", + "plt.figure(figsize=(8, 6))\n", + "sns.barplot(x=fraudulent_counts.index, y=fraudulent_counts.values)\n", + "plt.xlabel('Fraudulent')\n", + "plt.ylabel('Count')\n", + "plt.title('Counts of Fraudulent vs Non-Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "oyeB2MFRVIWi", + "outputId": "9236f907-4c16-49f7-c14b-883d21ae6c2c" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.countplot(data=df, x='employment_type', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Employment Type')\n", + "plt.xlabel('Employment Type')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "ORGFxjVVVJBi", + "outputId": "084304de-5618-436a-8958-6f36abd72be7" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.countplot(data=df, x='required_experience', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Required Experience')\n", + "plt.xlabel('Required Experience')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "GnRPXpBWVL7O", + "outputId": "8d347181-83d8-44ae-9c57-88d98825694d" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(30, 18))\n", + "sns.countplot(data=df, x='required_education', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Required Education')\n", + "plt.xlabel('Required Education')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8qKuYrkvVPlO" + }, + "source": [ + "We can see that there is no connection between those parameters and fake job postings. This way in the future processing we can remove them." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BbOwzXmdaJTw" + }, + "source": [ + "## Data cleaning/pre-processing\n", + "\n", + "One of the really important step related to any type of data processing is data cleaning. For texts it usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. \n", + "\n", + "In our case, to prepare data for Fake Job Postings recognition we will first, combine all relevant columns into single new record and then clean the data to work on it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jYLwp2wSaMdi" + }, + "outputs": [], + "source": [ + "# List of columns to concatenate\n", + "columns_to_concat = ['title', 'location', 'department', 'salary_range', 'company_profile',\n", + " 'description', 'requirements', 'benefits', 'employment_type',\n", + " 'required_experience', 'required_education', 'industry', 'function']\n", + "\n", + "# Concatenate the values of specified columns into a new column 'job_posting'\n", + "df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)\n", + "\n", + "# Create a new DataFrame with columns 'job_posting' and 'fraudulent'\n", + "new_df = df[['job_posting', 'fraudulent']].copy()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "FulR3zMiaMgI", + "outputId": "995058f3-f5f7-4aec-e1e0-94d42aad468f" + }, + "outputs": [], + "source": [ + "new_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0TpoEx1-YgCs", + "outputId": "8eaaf021-ae66-477a-f07b-fd8a353d17eb", + "scrolled": true + }, + "outputs": [], + "source": [ + "# import spacy\n", + "import re\n", + "import nltk\n", + "from nltk.corpus import stopwords\n", + "\n", + "nltk.download('stopwords')\n", + "\n", + "def preprocess_text(text):\n", + " # Remove newlines, carriage returns, and tabs\n", + " text = re.sub('\\n','', text)\n", + " text = re.sub('\\r','', text)\n", + " text = re.sub('\\t','', text)\n", + " # Remove URLs\n", + " text = re.sub(r\"http\\S+|www\\S+|https\\S+\", \"\", text, flags=re.MULTILINE)\n", + "\n", + " # Remove special characters\n", + " text = re.sub(r\"[^a-zA-Z0-9\\s]\", \"\", text)\n", + "\n", + " # Remove punctuation\n", + " text = re.sub(r'[^\\w\\s]', '', text)\n", + "\n", + " # Remove digits\n", + " text = re.sub(r'\\d', '', text)\n", + "\n", + " # Convert to lowercase\n", + " text = text.lower()\n", + "\n", + " # Remove stop words\n", + " stop_words = set(stopwords.words('english'))\n", + " words = [word for word in text.split() if word.lower() not in stop_words]\n", + " text = ' '.join(words)\n", + "\n", + " return text\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p9NHS6Vx2BE8" + }, + "outputs": [], + "source": [ + "new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text)\n", + "\n", + "new_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The next step in the pre-processing is lemmatization. It is a process to reduce a word to its root form, called a lemma. For example the verb 'planning' would be changed to 'plan' world." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kZnHODi-ZK33" + }, + "outputs": [], + "source": [ + "# Lemmatization\n", + "import en_core_web_sm\n", + "\n", + "nlp = en_core_web_sm.load()\n", + "\n", + "def lemmatize_text(text):\n", + " doc = nlp(text)\n", + " return \" \".join([token.lemma_ for token in doc])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uQauVQdw2LWa" + }, + "outputs": [], + "source": [ + "new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text)\n", + "\n", + "new_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dQDR6_SpZW0B" + }, + "source": [ + "At this stage we can also visualize the data with wordcloud by having special text column. We can show it for both fake and real dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 411 + }, + "id": "fdR9GAG6ZnPh", + "outputId": "57e9b5ae-87b4-4523-d0ae-8c8fd56cd9bc" + }, + "outputs": [], + "source": [ + "from wordcloud import WordCloud\n", + "\n", + "non_fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 0]['job_posting'])\n", + "fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 1]['job_posting'])\n", + "\n", + "wordcloud_non_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(non_fraudulent_text)\n", + "\n", + "wordcloud_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(fraudulent_text)\n", + "\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))\n", + "\n", + "ax1.imshow(wordcloud_non_fraudulent, interpolation='bilinear')\n", + "ax1.axis('off')\n", + "ax1.set_title('Non-Fraudulent Job Postings')\n", + "\n", + "ax2.imshow(wordcloud_fraudulent, interpolation='bilinear')\n", + "ax2.axis('off')\n", + "ax2.set_title('Fraudulent Job Postings')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ihtfhOr7aNMa" + }, + "source": [ + "## Fake job postings identification and removal\n", + "\n", + "Nowadays, it is unfortunate that not all the job offers that are posted on papular portals are genuine. Some of them are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. \n", + "\n", + "We will create bidirectional LSTM model with one hot encoding. Let's start with all necessary imports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VNdX-xcjtVS2" + }, + "outputs": [], + "source": [ + "from tensorflow.keras.layers import Embedding\n", + "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.preprocessing.text import one_hot\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Bidirectional\n", + "from tensorflow.keras.layers import Dropout" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make sure, you're using Tensorflow version 2.15.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "IxY47-s7tbjU", + "outputId": "02d68552-ff52-422b-9044-e55e35ef1236" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "tf.__version__" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let us import Intel Extension for TensorFlow*. We are using Python API `itex.experimental_ops_override()`. It automatically replace some TensorFlow operators by Custom Operators under `itex.ops` namespace, as well as to be compatible with existing trained parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import intel_extension_for_tensorflow as itex\n", + "\n", + "itex.experimental_ops_override()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We need to prepare data for the model we will create. First let's assign job_postings to X and fraudulent values to y (expected value)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "U-7klPFFtZgo" + }, + "outputs": [], + "source": [ + "X = new_df['job_posting']\n", + "y = new_df['fraudulent']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One hot encoding is a technique to represent categorical variables as numerical values. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3FFtUrPbtbmD" + }, + "outputs": [], + "source": [ + "voc_size = 5000\n", + "onehot_repr = [one_hot(words, voc_size) for words in X]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ygHx6LSg6ZUr", + "outputId": "5b152a4f-621b-400c-a65b-5fa19a934aa2" + }, + "outputs": [], + "source": [ + "sent_length = 40\n", + "embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length)\n", + "print(embedded_docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating model\n", + "\n", + "We are creating Deep Neural Network using Bidirectional LSTM. The architecture is as followed:\n", + "\n", + "* Embedding layer\n", + "* Bidirectiona LSTM Layer\n", + "* Dropout layer\n", + "* Dense layer with sigmod function\n", + "\n", + "We are using Adam optimizer with binary crossentropy. We are optimism accuracy.\n", + "\n", + "If Intel® Extension for TensorFlow* backend is XPU, `tf.keras.layers.LSTM` will be replaced by `itex.ops.ItexLSTM`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Vnhm4huG-Mat", + "outputId": "dbc59ef1-168a-4e11-f38d-b47674dd4be6" + }, + "outputs": [], + "source": [ + "embedding_vector_features = 50\n", + "model_itex = Sequential()\n", + "model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))\n", + "model_itex.add(Bidirectional(itex.ops.ItexLSTM(100)))\n", + "model_itex.add(Dropout(0.3))\n", + "model_itex.add(Dense(1, activation='sigmoid'))\n", + "model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", + "print(model_itex.summary())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1-tz3hyc-PvN" + }, + "outputs": [], + "source": [ + "X_final = np.array(embedded_docs)\n", + "y_final = np.array(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "POVN7X60-TnQ" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's train the model. We are using standard `model.fit()` method providing training and testing dataset. You can easily modify number of epochs in this training process but keep in mind that the model can become overtrained, so that it will have very good results on training data, but poor results on test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "U0cGa7ei-Ufh", + "outputId": "68ce942d-ea51-458f-ac6c-ab619ab1ce74" + }, + "outputs": [], + "source": [ + "model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The values returned by the model are in the range [0,1] Need to map them to integer values of 0 or 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "u4I8Y-R5EcDw", + "outputId": "be384d88-b27c-49c5-bebb-e9bdba986692" + }, + "outputs": [], + "source": [ + "y_pred = (model_itex.predict(X_test) > 0.5).astype(\"int32\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the `scikit-learn` library." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 675 + }, + "id": "0lB3N6fxtbom", + "outputId": "97b1713d-b373-44e1-a5b2-15e41aa84016" + }, + "outputs": [], + "source": [ + "from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report\n", + "\n", + "conf_matrix = confusion_matrix(y_test, y_pred)\n", + "print(\"Confusion matrix:\")\n", + "print(conf_matrix)\n", + "\n", + "ConfusionMatrixDisplay.from_predictions(y_test, y_pred)\n", + "\n", + "class_report = classification_report(y_test, y_pred)\n", + "print(\"Classification report:\")\n", + "print(class_report)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioa6oZNuaPnJ" + }, + "source": [ + "## Job recommendation by showing the most similar ones" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZGReO9ziJyXm" + }, + "source": [ + "Now, as we are sure that the data we are processing is real, we can get back to the original columns and create our recommendation system.\n", + "\n", + "Also use much more simple solution for recommendations. Even, as before we used Deep Learning to check if posting is fake, we can use classical machine learning algorithms to show similar job postings.\n", + "\n", + "First, let's filter fake job postings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "RsCZLWU0aMqN", + "outputId": "503c1b4e-26db-46fd-d69f-f8ee17c1519c" + }, + "outputs": [], + "source": [ + "real = df[df['fraudulent'] == 0]\n", + "real.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After that, we create a common column containing those text parameters that we want to be compared between theses and are relevant to us when making recommendations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "NLc-uoYeaMsy", + "outputId": "452602b0-88e2-4c9c-a6f0-5b069cc34009" + }, + "outputs": [], + "source": [ + "cols = ['title', 'description', 'requirements', 'required_experience', 'required_education', 'industry']\n", + "real = real[cols]\n", + "real.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "id": "mX-xc2OetVzx", + "outputId": "e0f24240-d8dd-4f79-fda6-db15d2f4c54f" + }, + "outputs": [], + "source": [ + "real = real.fillna(value='')\n", + "real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry']\n", + "real.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see the mechanism that we will use to prepare recommendations - we will use sentence similarity based on prepared `text` column in our dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's prepare a few example sentences that cover 4 topics. On these sentences it will be easier to show how the similarities between the texts work than on the whole large dataset we have." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " # Smartphones\n", + " \"I like my phone\",\n", + " \"My phone is not good.\",\n", + " \"Your cellphone looks great.\",\n", + "\n", + " # Weather\n", + " \"Will it snow tomorrow?\",\n", + " \"Recently a lot of hurricanes have hit the US\",\n", + " \"Global warming is real\",\n", + "\n", + " # Food and health\n", + " \"An apple a day, keeps the doctors away\",\n", + " \"Eating strawberries is healthy\",\n", + " \"Is paleo better than keto?\",\n", + "\n", + " # Asking about age\n", + " \"How old are you?\",\n", + " \"what is your age?\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we are preparing functions to show similarities between given sentences in the for of heat map. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import seaborn as sns\n", + "\n", + "def plot_similarity(labels, features, rotation):\n", + " corr = np.inner(features, features)\n", + " sns.set(font_scale=1.2)\n", + " g = sns.heatmap(\n", + " corr,\n", + " xticklabels=labels,\n", + " yticklabels=labels,\n", + " vmin=0,\n", + " vmax=1,\n", + " cmap=\"YlOrRd\")\n", + " g.set_xticklabels(labels, rotation=rotation)\n", + " g.set_title(\"Semantic Textual Similarity\")\n", + "\n", + "def run_and_plot(messages_):\n", + " message_embeddings_ = model.encode(messages_)\n", + " plot_similarity(messages_, message_embeddings_, 90)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_and_plot(messages)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's move back to our job postings dataset. First, we are using sentence encoding model to be able to calculate similarities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "encodings = []\n", + "for text in real['text']:\n", + " encodings.append(model.encode(text))\n", + "\n", + "real['encodings'] = encodings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, we can chose job posting we wan to calculate similarities to. In our case it is first job posting in the dataset, but you can easily change it to any other job posting, by changing value in the `index` variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = 0\n", + "corr = np.inner(encodings[index], encodings)\n", + "real['corr_to_first'] = corr" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And based on the calculated similarities, we can show top most similar job postings, by sorting them according to calculated correlation value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "real.sort_values(by=['corr_to_first'], ascending=False).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this code sample we created job recommendation system. First, we explored and analyzed the dataset, then we pre-process the data and create fake job postings detection model. At the end we used sentence similarities to show top 5 recommendations - the most similar job descriptions to the chosen one. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Tensorflow", + "language": "python", + "name": "tensorflow" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py new file mode 100644 index 0000000000..425ab1f5dd --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py @@ -0,0 +1,634 @@ +# %% [markdown] +# # Job recommendation system +# +# The code sample contains the following parts: +# +# 1. Data exploration and visualization +# 2. Data cleaning/pre-processing +# 3. Fake job postings identification and removal +# 4. Job recommendation by showing the most similar job postings +# +# The scenario is that someone wants to find the best posting for themselves. They have collected the data, but he is not sure if all the data is real. Therefore, based on a trained model, as in this sample, they identify with a high degree of accuracy which postings are real, and it is among them that they choose the best ad for themselves. +# +# For simplicity, only one dataset will be used within this code, but the process using one dataset is not significantly different from the one described earlier. +# + +# %% [markdown] +# ## Data exploration and visualization +# +# For the purpose of this code sample we will use Real or Fake: Fake Job Postings dataset available over HuggingFace API. In this first part we will focus on data exploration and visualization. In standard end-to-end workload it is the first step. Engineer needs to first know the data to be able to work on it and prepare solution that will utilize dataset the best. +# +# Lest start with loading the dataset. We are using datasets library to do that. + +# %% +from datasets import load_dataset + +dataset = load_dataset("victor/real-or-fake-fake-jobposting-prediction") +dataset = dataset['train'] + +# %% [markdown] +# To better analyze and understand the data we are transferring it to pandas DataFrame, so we are able to take benefit from all pandas data transformations. Pandas library provides multiple useful functions for data manipulation so it is usual choice at this stage of machine learning or deep learning project. +# + +# %% +import pandas as pd +df = dataset.to_pandas() + +# %% [markdown] +# Let's see 5 first and 5 last rows in the dataset we are working on. + +# %% +df.head() + +# %% +df.tail() + +# %% [markdown] +# Now, lets print a concise summary of the dataset. This way we will see all the column names, know the number of rows and types in every of the column. It is a great overview on the features of the dataset. + +# %% +df.info() + +# %% [markdown] +# At this point it is a good idea to make sure our dataset doen't contain any data duplication that could impact the results of our future system. To do that we firs need to remove `job_id` column. It contains unique number for each job posting so even if the rest of the data is the same between 2 postings it makes it different. + +# %% +# Drop the 'job_id' column +df = df.drop(columns=['job_id']) +df.head() + +# %% [markdown] +# And now, the actual duplicates removal. We first pring the number of duplicates that are in our dataset, than using `drop_duplicated` method we are removing them and after this operation printing the number of the duplicates. If everything works as expected after duplicates removal we should print `0` as current number of duplicates in the dataset. + +# %% +# let's make sure that there are no duplicated jobs + +print(df.duplicated().sum()) +df = df.drop_duplicates() +print(df.duplicated().sum()) + +# %% [markdown] +# Now we can visualize the data from the dataset. First let's visualize data as it is all real, and later, for the purposes of the fake data detection, we will also visualize it spreading fake and real data. +# +# When working with text data it can be challenging to visualize it. Thankfully, there is a `wordcloud` library that shows common words in the analyzed texts. The bigger word is, more often the word is in the text. Wordclouds allow us to quickly identify the most important topic and themes in a large text dataset and also explore patterns and trends in textural data. +# +# In our example, we will create wordcloud for job titles, to have high-level overview of job postings we are working with. + +# %% +from wordcloud import WordCloud # module to print word cloud +from matplotlib import pyplot as plt +import seaborn as sns + +# On the basis of Job Titles form word cloud +job_titles_text = ' '.join(df['title']) +wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text) + +# Plotting Word Cloud +plt.figure(figsize=(10, 6)) +plt.imshow(wordcloud, interpolation='bilinear') +plt.title('Job Titles') +plt.axis('off') +plt.tight_layout() +plt.show() + +# %% [markdown] +# Different possibility to get some information from this type of dataset is by showing top-n most common values in given column or distribution of the values int his column. +# Let's show top 10 most common job titles and compare this result with previously showed wordcould. + +# %% +# Get Count of job title +job_title_counts = df['title'].value_counts() + +# Plotting a bar chart for the top 10 most common job titles +top_job_titles = job_title_counts.head(10) +plt.figure(figsize=(10, 6)) +top_job_titles.sort_values().plot(kind='barh') +plt.title('Top 10 Most Common Job Titles') +plt.xlabel('Frequency') +plt.ylabel('Job Titles') +plt.show() + +# %% [markdown] +# Now we can do the same for different columns, as `employment_type`, `required_experience`, `telecommuting`, `has_company_logo` and `has_questions`. These should give us reale good overview of different parts of our dataset. + +# %% +# Count the occurrences of each work type +work_type_counts = df['employment_type'].value_counts() + +# Plotting the distribution of work types +plt.figure(figsize=(8, 6)) +work_type_counts.sort_values().plot(kind='barh') +plt.title('Distribution of Work Types Offered by Jobs') +plt.xlabel('Frequency') +plt.ylabel('Work Types') +plt.show() + +# %% +# Count the occurrences of required experience types +work_type_counts = df['required_experience'].value_counts() + +# Plotting the distribution of work types +plt.figure(figsize=(8, 6)) +work_type_counts.sort_values().plot(kind='barh') +plt.title('Distribution of Required Experience by Jobs') +plt.xlabel('Frequency') +plt.ylabel('Required Experience') +plt.show() + +# %% [markdown] +# For employment_type and required_experience we also created matrix to see if there is any corelation between those two. To visualize it we created heatmap. If you think that some of the parameters can be related, creating similar heatmap can be a good idea. + +# %% +from matplotlib import pyplot as plt +import seaborn as sns +import pandas as pd + +plt.subplots(figsize=(8, 8)) +df_2dhist = pd.DataFrame({ + x_label: grp['required_experience'].value_counts() + for x_label, grp in df.groupby('employment_type') +}) +sns.heatmap(df_2dhist, cmap='viridis') +plt.xlabel('employment_type') +_ = plt.ylabel('required_experience') + +# %% +# Count the occurrences of unique values in the 'telecommuting' column +telecommuting_counts = df['telecommuting'].value_counts() + +plt.figure(figsize=(8, 6)) +telecommuting_counts.sort_values().plot(kind='barh') +plt.title('Counts of telecommuting vs Non-telecommuting') +plt.xlabel('count') +plt.ylabel('telecommuting') +plt.show() + +# %% +has_company_logo_counts = df['has_company_logo'].value_counts() + +plt.figure(figsize=(8, 6)) +has_company_logo_counts.sort_values().plot(kind='barh') +plt.ylabel('has_company_logo') +plt.xlabel('Count') +plt.title('Counts of With_Logo vs Without_Logo') +plt.show() + +# %% +has_questions_counts = df['has_questions'].value_counts() + +# Plot the counts +plt.figure(figsize=(8, 6)) +has_questions_counts.sort_values().plot(kind='barh') +plt.ylabel('has_questions') +plt.xlabel('Count') +plt.title('Counts Questions vs NO_Questions') +plt.show() + +# %% [markdown] +# From the job recommendations point of view the salary and location can be really important parameters to take into consideration. In given dataset we have salary ranges available so there is no need for additional data processing rather than removal of empty ranges but if the dataset you're working on has specific values, consider organizing it into appropriate ranges and only then displaying the result. + +# %% +# Splitting benefits by comma and creating a list of benefits +benefits_list = df['salary_range'].str.split(',').explode() +benefits_list = benefits_list[benefits_list != 'None'] +benefits_list = benefits_list[benefits_list != '0-0'] + + +# Counting the occurrences of each skill +benefits_count = benefits_list.str.strip().value_counts() + +# Plotting the top 10 most common benefits +top_benefits = benefits_count.head(10) +plt.figure(figsize=(10, 6)) +top_benefits.sort_values().plot(kind='barh') +plt.title('Top 10 Salaries Range Offered by Companies') +plt.xlabel('Frequency') +plt.ylabel('Salary Range') +plt.show() + +# %% [markdown] +# For the location we have both county, state and city specified, so we need to split it into individual columns, and then show top 10 counties and cities. + +# %% +# Split the 'location' column into separate columns for country, state, and city +location_split = df['location'].str.split(', ', expand=True) +df['Country'] = location_split[0] +df['State'] = location_split[1] +df['City'] = location_split[2] + +# %% +# Count the occurrences of unique values in the 'Country' column +Country_counts = df['Country'].value_counts() + +# Select the top 10 most frequent occurrences +top_10_Country = Country_counts.head(10) + +# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels +plt.figure(figsize=(14, 10)) +sns.barplot(y=top_10_Country.index, x=top_10_Country.values) +plt.ylabel('Country') +plt.xlabel('Count') +plt.title('Top 10 Most Frequent Country') +plt.show() + +# %% +# Count the occurrences of unique values in the 'City' column +City_counts = df['City'].value_counts() + +# Select the top 10 most frequent occurrences +top_10_City = City_counts.head(10) + +# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels +plt.figure(figsize=(14, 10)) +sns.barplot(y=top_10_City.index, x=top_10_City.values) +plt.ylabel('City') +plt.xlabel('Count') +plt.title('Top 10 Most Frequent City') +plt.show() + +# %% [markdown] +# ### Fake job postings data visualization +# +# What about fraudulent class? Let see how many of the jobs in the dataset are fake. Whether there are equally true and false offers, or whether there is a significant disproportion between the two. + +# %% +## fake job visualization +# Count the occurrences of unique values in the 'fraudulent' column +fraudulent_counts = df['fraudulent'].value_counts() + +# Plot the counts using a rainbow color palette +plt.figure(figsize=(8, 6)) +sns.barplot(x=fraudulent_counts.index, y=fraudulent_counts.values) +plt.xlabel('Fraudulent') +plt.ylabel('Count') +plt.title('Counts of Fraudulent vs Non-Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(10, 6)) +sns.countplot(data=df, x='employment_type', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Employment Type') +plt.xlabel('Employment Type') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(10, 6)) +sns.countplot(data=df, x='required_experience', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Required Experience') +plt.xlabel('Required Experience') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(30, 18)) +sns.countplot(data=df, x='required_education', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Required Education') +plt.xlabel('Required Education') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% [markdown] +# We can see that there is no connection between those parameters and fake job postings. This way in the future processing we can remove them. + +# %% [markdown] +# ## Data cleaning/pre-processing +# +# One of the really important step related to any type of data processing is data cleaning. For texts it usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. +# +# In our case, to prepare data for Fake Job Postings recognition we will first, combine all relevant columns into single new record and then clean the data to work on it. + +# %% +# List of columns to concatenate +columns_to_concat = ['title', 'location', 'department', 'salary_range', 'company_profile', + 'description', 'requirements', 'benefits', 'employment_type', + 'required_experience', 'required_education', 'industry', 'function'] + +# Concatenate the values of specified columns into a new column 'job_posting' +df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1) + +# Create a new DataFrame with columns 'job_posting' and 'fraudulent' +new_df = df[['job_posting', 'fraudulent']].copy() + +# %% +new_df.head() + +# %% +# import spacy +import re +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +def preprocess_text(text): + # Remove newlines, carriage returns, and tabs + text = re.sub('\n','', text) + text = re.sub('\r','', text) + text = re.sub('\t','', text) + # Remove URLs + text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE) + + # Remove special characters + text = re.sub(r"[^a-zA-Z0-9\s]", "", text) + + # Remove punctuation + text = re.sub(r'[^\w\s]', '', text) + + # Remove digits + text = re.sub(r'\d', '', text) + + # Convert to lowercase + text = text.lower() + + # Remove stop words + stop_words = set(stopwords.words('english')) + words = [word for word in text.split() if word.lower() not in stop_words] + text = ' '.join(words) + + return text + + + +# %% +new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text) + +new_df.head() + +# %% [markdown] +# The next step in the pre-processing is lemmatization. It is a process to reduce a word to its root form, called a lemma. For example the verb 'planning' would be changed to 'plan' world. + +# %% +# Lemmatization +import en_core_web_sm + +nlp = en_core_web_sm.load() + +def lemmatize_text(text): + doc = nlp(text) + return " ".join([token.lemma_ for token in doc]) + +# %% +new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text) + +new_df.head() + +# %% [markdown] +# At this stage we can also visualize the data with wordcloud by having special text column. We can show it for both fake and real dataset. + +# %% +from wordcloud import WordCloud + +non_fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 0]['job_posting']) +fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 1]['job_posting']) + +wordcloud_non_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(non_fraudulent_text) + +wordcloud_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(fraudulent_text) + +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10)) + +ax1.imshow(wordcloud_non_fraudulent, interpolation='bilinear') +ax1.axis('off') +ax1.set_title('Non-Fraudulent Job Postings') + +ax2.imshow(wordcloud_fraudulent, interpolation='bilinear') +ax2.axis('off') +ax2.set_title('Fraudulent Job Postings') + +plt.show() + +# %% [markdown] +# ## Fake job postings identification and removal +# +# Nowadays, it is unfortunate that not all the job offers that are posted on papular portals are genuine. Some of them are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. +# +# We will create bidirectional LSTM model with one hot encoding. Let's start with all necessary imports. + +# %% +from tensorflow.keras.layers import Embedding +from tensorflow.keras.preprocessing.sequence import pad_sequences +from tensorflow.keras.models import Sequential +from tensorflow.keras.preprocessing.text import one_hot +from tensorflow.keras.layers import Dense +from tensorflow.keras.layers import Bidirectional +from tensorflow.keras.layers import Dropout + +# %% [markdown] +# Make sure, you're using Tensorflow version 2.15.0 + +# %% +import tensorflow as tf +tf.__version__ + +# %% [markdown] +# Now, let us import Intel Extension for TensorFlow*. We are using Python API `itex.experimental_ops_override()`. It automatically replace some TensorFlow operators by Custom Operators under `itex.ops` namespace, as well as to be compatible with existing trained parameters. + +# %% +import intel_extension_for_tensorflow as itex + +itex.experimental_ops_override() + +# %% [markdown] +# We need to prepare data for the model we will create. First let's assign job_postings to X and fraudulent values to y (expected value). + +# %% +X = new_df['job_posting'] +y = new_df['fraudulent'] + +# %% [markdown] +# One hot encoding is a technique to represent categorical variables as numerical values. + +# %% +voc_size = 5000 +onehot_repr = [one_hot(words, voc_size) for words in X] + +# %% +sent_length = 40 +embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length) +print(embedded_docs) + +# %% [markdown] +# ### Creating model +# +# We are creating Deep Neural Network using Bidirectional LSTM. The architecture is as followed: +# +# * Embedding layer +# * Bidirectiona LSTM Layer +# * Dropout layer +# * Dense layer with sigmod function +# +# We are using Adam optimizer with binary crossentropy. We are optimism accuracy. +# +# If Intel® Extension for TensorFlow* backend is XPU, `tf.keras.layers.LSTM` will be replaced by `itex.ops.ItexLSTM`. + +# %% +embedding_vector_features = 50 +model_itex = Sequential() +model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length)) +model_itex.add(Bidirectional(itex.ops.ItexLSTM(100))) +model_itex.add(Dropout(0.3)) +model_itex.add(Dense(1, activation='sigmoid')) +model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) +print(model_itex.summary()) + +# %% +X_final = np.array(embedded_docs) +y_final = np.array(y) + +# %% [markdown] +# + +# %% +from sklearn.model_selection import train_test_split +X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320) + +# %% [markdown] +# Now, let's train the model. We are using standard `model.fit()` method providing training and testing dataset. You can easily modify number of epochs in this training process but keep in mind that the model can become overtrained, so that it will have very good results on training data, but poor results on test data. + +# %% +model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64) + +# %% [markdown] +# The values returned by the model are in the range [0,1] Need to map them to integer values of 0 or 1. + +# %% +y_pred = (model_itex.predict(X_test) > 0.5).astype("int32") + +# %% [markdown] +# To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the `scikit-learn` library. + +# %% +from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report + +conf_matrix = confusion_matrix(y_test, y_pred) +print("Confusion matrix:") +print(conf_matrix) + +ConfusionMatrixDisplay.from_predictions(y_test, y_pred) + +class_report = classification_report(y_test, y_pred) +print("Classification report:") +print(class_report) + +# %% [markdown] +# ## Job recommendation by showing the most similar ones + +# %% [markdown] +# Now, as we are sure that the data we are processing is real, we can get back to the original columns and create our recommendation system. +# +# Also use much more simple solution for recommendations. Even, as before we used Deep Learning to check if posting is fake, we can use classical machine learning algorithms to show similar job postings. +# +# First, let's filter fake job postings. + +# %% +real = df[df['fraudulent'] == 0] +real.head() + +# %% [markdown] +# After that, we create a common column containing those text parameters that we want to be compared between theses and are relevant to us when making recommendations. + +# %% +cols = ['title', 'description', 'requirements', 'required_experience', 'required_education', 'industry'] +real = real[cols] +real.head() + +# %% +real = real.fillna(value='') +real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry'] +real.head() + +# %% [markdown] +# Let's see the mechanism that we will use to prepare recommendations - we will use sentence similarity based on prepared `text` column in our dataset. + +# %% +from sentence_transformers import SentenceTransformer + +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') + +# %% [markdown] +# Let's prepare a few example sentences that cover 4 topics. On these sentences it will be easier to show how the similarities between the texts work than on the whole large dataset we have. + +# %% +messages = [ + # Smartphones + "I like my phone", + "My phone is not good.", + "Your cellphone looks great.", + + # Weather + "Will it snow tomorrow?", + "Recently a lot of hurricanes have hit the US", + "Global warming is real", + + # Food and health + "An apple a day, keeps the doctors away", + "Eating strawberries is healthy", + "Is paleo better than keto?", + + # Asking about age + "How old are you?", + "what is your age?", +] + +# %% [markdown] +# Now, we are preparing functions to show similarities between given sentences in the for of heat map. + +# %% +import numpy as np +import seaborn as sns + +def plot_similarity(labels, features, rotation): + corr = np.inner(features, features) + sns.set(font_scale=1.2) + g = sns.heatmap( + corr, + xticklabels=labels, + yticklabels=labels, + vmin=0, + vmax=1, + cmap="YlOrRd") + g.set_xticklabels(labels, rotation=rotation) + g.set_title("Semantic Textual Similarity") + +def run_and_plot(messages_): + message_embeddings_ = model.encode(messages_) + plot_similarity(messages_, message_embeddings_, 90) + +# %% +run_and_plot(messages) + +# %% [markdown] +# Now, let's move back to our job postings dataset. First, we are using sentence encoding model to be able to calculate similarities. + +# %% +encodings = [] +for text in real['text']: + encodings.append(model.encode(text)) + +real['encodings'] = encodings + +# %% [markdown] +# Then, we can chose job posting we wan to calculate similarities to. In our case it is first job posting in the dataset, but you can easily change it to any other job posting, by changing value in the `index` variable. + +# %% +index = 0 +corr = np.inner(encodings[index], encodings) +real['corr_to_first'] = corr + +# %% [markdown] +# And based on the calculated similarities, we can show top most similar job postings, by sorting them according to calculated correlation value. + +# %% +real.sort_values(by=['corr_to_first'], ascending=False).head() + +# %% [markdown] +# In this code sample we created job recommendation system. First, we explored and analyzed the dataset, then we pre-process the data and create fake job postings detection model. At the end we used sentence similarities to show top 5 recommendations - the most similar job descriptions to the chosen one. + +# %% +print("[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]") + + diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt new file mode 100644 index 0000000000..e63c6e13dc --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md new file mode 100644 index 0000000000..6964819ee4 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md @@ -0,0 +1,177 @@ +# Job Recommendation System: End-to-End Deep Learning Workload + + +This sample illustrates the use of Intel® Extension for TensorFlow* to build and run an end-to-end AI workload on the example of the job recommendation system. + +| Property | Description +|:--- |:--- +| Category | Reference Designs and End to End +| What you will learn | How to use Intel® Extension for TensorFlow* to build end to end AI workload? +| Time to complete | 30 minutes + +## Purpose + +This code sample show end-to-end Deep Learning workload in the example of job recommendation system. It consists of four main parts: + +1. Data exploration and visualization - showing what the dataset is looking like, what are some of the main features and what is a data distribution in it. +2. Data cleaning and pre-processing - removal of duplicates, explanation all necessary steps for text pre-processing. +3. Fraud job postings removal - finding which of the job posting are fake using LSTM DNN and filtering them. +4. Job recommendation - calculation and providing top-n job descriptions similar to the chosen one. + +## Prerequisites + +| Optimized for | Description +| :--- | :--- +| OS | Linux, Ubuntu* 20.04 +| Hardware | GPU +| Software | Intel® Extension for TensorFlow* +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). + + +## Key Implementation Details + +This sample creates Deep Neural Networ to fake job postings detections using Intel® Extension for TensorFlow* LSTM layer on GPU. It also utilizes `itex.experimental_ops_override()` to automatically replace some TensorFlow operators by Custom Operators form Intel® Extension for TensorFlow*. + +The sample tutorial contains one Jupyter Notebook and one Python script. You can use either. + +## Environment Setup +You will need to download and install the following toolkits, tools, and components to use the sample. + + +**1. Get AI Tools** + +Required AI Tools: + +If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. + +>**Note**: If Docker option is chosen in AI Tools Selector, refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + +**2. (Offline Installer) Activate the AI Tools bundle base environment** + +If the default path is used during the installation of AI Tools: +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If a non-default path is used: +``` +source /bin/activate +``` + +**3. (Offline Installer) Activate relevant Conda environment** + +``` +conda activate tensorflow-gpu +``` + +**4. Clone the GitHub repository** + + +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem +``` + +**5. Install dependencies** + +>**Note**: Before running the following commands, make sure your Conda/Python environment with AI Tools installed is activated + +``` +pip install -r requirements.txt +pip install notebook +``` +For Jupyter Notebook, refer to [Installing Jupyter](https://jupyter.org/install) for detailed installation instructions. + +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch#environment-setup) is completed. + +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) + +**1. Register Conda kernel to Jupyter Notebook kernel** + +If the default path is used during the installation of AI Tools: +``` +$HOME/intel/oneapi/intelpython/envs/tensorflow-gpu/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +If a non-default path is used: +``` +/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +**2. Launch Jupyter Notebook** + +``` +jupyter notebook --ip=0.0.0.0 +``` +**3. Follow the instructions to open the URL with the token in your browser** + +**4. Select the Notebook** + +``` +JobRecommendationSystem.ipynb +``` +**5. Change the kernel to `tensorflow-gpu`** + +**6. Run every cell in the Notebook in sequence** + +### Conda/PIP +> **Note**: Before running the instructions below, make sure your Conda/Python environment with AI Tools installed is activated + +**1. Register Conda/Python kernel to Jupyter Notebook kernel** + +For Conda: +``` +/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +To know , run `conda env list` and find your Conda environment path. + +For PIP: +``` +python -m ipykernel install --user --name=tensorflow-gpu +``` +**2. Launch Jupyter Notebook** + +``` +jupyter notebook --ip=0.0.0.0 +``` +**3. Follow the instructions to open the URL with the token in your browser** + +**4. Select the Notebook** + +``` +JobRecommendationSystem.ipynb +``` +**5. Change the kernel to ``** + + +**6. Run every cell in the Notebook in sequence** + +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + + + +## Example Output + + If successful, the sample displays [CODE_SAMPLE_COMPLETED_SUCCESSFULLY]. Additionally, the sample shows multiple diagram explaining dataset, the training progress for fraud job posting detection and top job recommendations. + +## Related Samples + + +* [Intel Extension For TensorFlow Getting Started Sample](https://github.com/oneapi-src/oneAPI-samples/blob/development/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md) +* [Leveraging Intel Extension for TensorFlow with LSTM for Text Generation Sample](https://github.com/oneapi-src/oneAPI-samples/blob/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/README.md) + +## License + +Code samples are licensed under the MIT license. See +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. + +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt new file mode 100644 index 0000000000..15bcd710c6 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt @@ -0,0 +1,10 @@ +ipykernel +matplotlib +sentence_transformers +transformers +datasets +accelerate +wordcloud +spacy +jinja2 +nltk \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json new file mode 100644 index 0000000000..31e14cab36 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json @@ -0,0 +1,29 @@ +{ + "guid": "80708728-0BD4-435E-961D-178E5ED1450C", + "name": "JobRecommendationSystem: End-to-End Deep Learning Workload", + "categories": ["Toolkit/oneAPI AI And Analytics/End-to-End Workloads"], + "description": "This sample illustrates the use of Intel® Extension for TensorFlow* to build and run an end-to-end AI workload on the example of the job recommendation system", + "builder": ["cli"], + "toolchain": ["jupyter"], + "languages": [{"python":{}}], + "os":["linux"], + "targetDevice": ["GPU"], + "ciTests": { + "linux": [ + { + "env": [], + "id": "JobRecommendationSystem_py", + "steps": [ + "source /intel/oneapi/intelpython/bin/activate", + "conda env remove -n user_tensorflow-gpu", + "conda create --name user_tensorflow-gpu --clone tensorflow-gpu", + "conda activate user_tensorflow-gpu", + "pip install -r requirements.txt", + "python -m ipykernel install --user --name=user_tensorflow-gpu", + "python JobRecommendationSystem.py" + ] + } + ] +}, +"expertise": "Reference Designs and End to End" +} \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt new file mode 100644 index 0000000000..e9f8042d0a --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt @@ -0,0 +1,253 @@ +oneAPI Code Samples - Third Party Programs File + +This file contains the list of third party software ("third party programs") +contained in the Intel software and their required notices and/or license +terms. This third party software, even if included with the distribution of the +Intel software, may be governed by separate license terms, including without +limitation, third party license terms, other Intel software license terms, and +open source software license terms. These separate license terms govern your use +of the third party programs as set forth in the “third-party-programs.txt” or +other similarly named text file. + +Third party programs and their corresponding required notices and/or license +terms are listed below. + +-------------------------------------------------------------------------------- + +1. Nothings STB Libraries + +stb/LICENSE + + This software is available under 2 licenses -- choose whichever you prefer. + ------------------------------------------------------------------------------ + ALTERNATIVE A - MIT License + Copyright (c) 2017 Sean Barrett + Permission is hereby granted, free of charge, to any person obtaining a copy of + this software and associated documentation files (the "Software"), to deal in + the Software without restriction, including without limitation the rights to + use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies + of the Software, and to permit persons to whom the Software is furnished to do + so, subject to the following conditions: + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + ------------------------------------------------------------------------------ + ALTERNATIVE B - Public Domain (www.unlicense.org) + This is free and unencumbered software released into the public domain. + Anyone is free to copy, modify, publish, use, compile, sell, or distribute this + software, either in source code form or as a compiled binary, for any purpose, + commercial or non-commercial, and by any means. + In jurisdictions that recognize copyright laws, the author or authors of this + software dedicate any and all copyright interest in the software to the public + domain. We make this dedication for the benefit of the public at large and to + the detriment of our heirs and successors. We intend this dedication to be an + overt act of relinquishment in perpetuity of all present and future rights to + this software under copyright law. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION + WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +-------------------------------------------------------------------------------- + +2. FGPA example designs-gzip + + SDL2.0 + +zlib License + + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + +-------------------------------------------------------------------------------- + +3. Nbody + (c) 2019 Fabio Baruffa + + Plotly.js + Copyright (c) 2020 Plotly, Inc + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +© 2020 GitHub, Inc. + +-------------------------------------------------------------------------------- + +4. GNU-EFI + Copyright (c) 1998-2000 Intel Corporation + +The files in the "lib" and "inc" subdirectories are using the EFI Application +Toolkit distributed by Intel at http://developer.intel.com/technology/efi + +This code is covered by the following agreement: + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION +ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT +TO CHANGE WITHOUT NOTICE. + +-------------------------------------------------------------------------------- + +5. Edk2 + Copyright (c) 2019, Intel Corporation. All rights reserved. + + Edk2 Basetools + Copyright (c) 2019, Intel Corporation. All rights reserved. + +SPDX-License-Identifier: BSD-2-Clause-Patent + +-------------------------------------------------------------------------------- + +6. Heat Transmission + +GNU LESSER GENERAL PUBLIC LICENSE +Version 3, 29 June 2007 + +Copyright © 2007 Free Software Foundation, Inc. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. + +0. Additional Definitions. +As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. + +“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. + +An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. + +A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. + +The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. + +The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. + +1. Exception to Section 3 of the GNU GPL. +You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. + +2. Conveying Modified Versions. +If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: + +a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or +b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. +3. Object Code Incorporating Material from Library Header Files. +The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: + +a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the object code with a copy of the GNU GPL and this license document. +4. Combined Works. +You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: + +a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the Combined Work with a copy of the GNU GPL and this license document. +c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. +d) Do one of the following: +0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. +1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. +e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) +5. Combined Libraries. +You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: + +a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. +b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. +6. Revised Versions of the GNU Lesser General Public License. +The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. + +If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. + +-------------------------------------------------------------------------------- +7. Rodinia + Copyright (c)2008-2011 University of Virginia +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +If you use this software or a modified version of it, please cite the most relevant among the following papers: + + - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International +Symposium on Computer Architecture (ISCA), June 2010. + + - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. +Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium +on Workload Characterization, Oct 2009. + +- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization +for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International +Conference on Supercomputing (ICS), June 2009. + +- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems +Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International +Symposium on Computer Architecture (ISCA), June 2009. + +- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: +A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel +and Distributed Processing Symposium (IPDPS), May 2009. + +- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance +Study of General Purpose Applications on Graphics Processors using CUDA" Journal of +Parallel and Distributed Computing, Elsevier, June 2008. + +-------------------------------------------------------------------------------- +Other names and brands may be claimed as the property of others. + +-------------------------------------------------------------------------------- \ No newline at end of file diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/.gitkeep b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt new file mode 100644 index 0000000000..e63c6e13dc --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md new file mode 100644 index 0000000000..a8fb984dd9 --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md @@ -0,0 +1,140 @@ +# `JAX Getting Started` Sample + +The `JAX Getting Started` sample demonstrates how to train a JAX model and run inference on Intel® hardware. +| Property | Description +|:--- |:--- +| Category | Get Start Sample +| What you will learn | How to start using JAX* on Intel® hardware. +| Time to complete | 10 minutes + +## Purpose + +JAX is a high-performance numerical computing library that enables automatic differentiation. It provides features like just-in-time compilation and efficient parallelization for machine learning and scientific computing tasks. + +This sample code shows how to get started with JAX on CPU. The sample code defines a simple neural network that trains on the MNIST dataset using JAX for parallel computations across multiple CPU cores. The network trains over multiple epochs, evaluates accuracy, and adjusts parameters using stochastic gradient descent across devices. + +## Prerequisites + +| Optimized for | Description +|:--- |:--- +| OS | Ubuntu* 22.0.4 and newer +| Hardware | Intel® Xeon® Scalable processor family +| Software | JAX + +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). + +## Key Implementation Details + +The getting-started sample code uses the python file 'spmd_mnist_classifier_fromscratch.py' under the examples directory in the +[jax repository](https://github.com/google/jax/). +It implements a simple neural network's training and inference for mnist images. The images are downloaded to a temporary directory when the example is run first. +- **init_random_params** initializes the neural network weights and biases for each layer. +- **predict** computes the forward pass of the network, applying weights, biases, and activations to inputs. +- **loss** calculates the cross-entropy loss between predictions and target labels. +- **spmd_update** performs parallel gradient updates across multiple devices using JAX’s pmap and lax.psum. +- **accuracy** computes the accuracy of the model by predicting the class of each input in the batch and comparing it to the true target class. It uses the *jnp.argmax* function to find the predicted class and then computes the mean of correct predictions. +- **data_stream** function generates batches of shuffled training data. It reshapes the data so that it can be split across multiple cores, ensuring that the batch size is divisible by the number of cores for parallel processing. +- **training loop** trains the model for a set number of epochs, updating parameters and printing training/test accuracy after each epoch. The parameters are replicated across devices and updated in parallel using spmd_update. After each epoch, the model’s accuracy is evaluated on both training and test data using accuracy. + +## Environment Setup + +You will need to download and install the following toolkits, tools, and components to use the sample. + +**1. Get Intel® AI Tools** + +Required AI Tools: 'JAX' +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector.
+please see the [supported versions](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). + +>**Note**: If Docker option is chosen in AI Tools Selector, refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + +**2. (Offline Installer) Activate the AI Tools bundle base environment** + +If the default path is used during the installation of AI Tools: +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If a non-default path is used: +``` +source /bin/activate +``` + +**3. (Offline Installer) Activate relevant Conda environment** + +For the system with Intel CPU: +``` +conda activate jax +``` + +**4. Clone the GitHub repository** +``` +git clone https://github.com/google/jax.git +cd jax +export PYTHONPATH=$PYTHONPATH:$(pwd) +``` +## Run the Sample + +>**Note**: Before running the sample, make sure Environment Setup is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)/Conda/PIP](#ai-tools-offline-installer-validatedcondapip) +* [Docker](#docker) +### AI Tools Offline Installer (Validated)/Conda/PIP +``` + python examples/spmd_mnist_classifier_fromscratch.py +``` +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. +## Example Output +1. When the program is run, you should see results similar to the following: + +``` +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz to /tmp/jax_example_data/ +Epoch 0 in 2.71 sec +Training set accuracy 0.7381166815757751 +Test set accuracy 0.7516999840736389 +Epoch 1 in 2.35 sec +Training set accuracy 0.81454998254776 +Test set accuracy 0.8277999758720398 +Epoch 2 in 2.33 sec +Training set accuracy 0.8448166847229004 +Test set accuracy 0.8568999767303467 +Epoch 3 in 2.34 sec +Training set accuracy 0.8626833558082581 +Test set accuracy 0.8715999722480774 +Epoch 4 in 2.30 sec +Training set accuracy 0.8752999901771545 +Test set accuracy 0.8816999793052673 +Epoch 5 in 2.33 sec +Training set accuracy 0.8839333653450012 +Test set accuracy 0.8899999856948853 +Epoch 6 in 2.37 sec +Training set accuracy 0.8908833265304565 +Test set accuracy 0.8944999575614929 +Epoch 7 in 2.31 sec +Training set accuracy 0.8964999914169312 +Test set accuracy 0.8986999988555908 +Epoch 8 in 2.28 sec +Training set accuracy 0.9016000032424927 +Test set accuracy 0.9034000039100647 +Epoch 9 in 2.31 sec +Training set accuracy 0.9060333371162415 +Test set accuracy 0.9059999585151672 +``` + +2. Troubleshooting + + If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility + +## License + +Code samples are licensed under the MIT license. See +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. + +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh new file mode 100644 index 0000000000..2a8313d002 --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh @@ -0,0 +1,6 @@ +source $HOME/intel/oneapi/intelpython/bin/activate +conda activate jax +git clone https://github.com/google/jax.git +cd jax +export PYTHONPATH=$PYTHONPATH:$(pwd) +python examples/spmd_mnist_classifier_fromscratch.py diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json new file mode 100644 index 0000000000..96c1fffd5b --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json @@ -0,0 +1,24 @@ +{ + "guid": "9A6A140B-FBD0-4CB2-849A-9CAF15A6F3B1", + "name": "Getting Started example for JAX CPU", + "categories": ["Toolkit/oneAPI AI And Analytics/Getting Started"], + "description": "This sample illustrates how to train a JAX model and run inference", + "builder": ["cli"], + "languages": [{ + "python": {} + }], + "os": ["linux"], + "targetDevice": ["CPU"], + "ciTests": { + "linux": [{ + "id": "JAX CPU example", + "steps": [ + "git clone https://github.com/google/jax.git", + "cd jax", + "conda activate jax", + "python examples/spmd_mnist_classifier_fromscratch.py" + ] + }] + }, + "expertise": "Getting Started" +} diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt new file mode 100644 index 0000000000..e9f8042d0a --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt @@ -0,0 +1,253 @@ +oneAPI Code Samples - Third Party Programs File + +This file contains the list of third party software ("third party programs") +contained in the Intel software and their required notices and/or license +terms. This third party software, even if included with the distribution of the +Intel software, may be governed by separate license terms, including without +limitation, third party license terms, other Intel software license terms, and +open source software license terms. These separate license terms govern your use +of the third party programs as set forth in the “third-party-programs.txt” or +other similarly named text file. + +Third party programs and their corresponding required notices and/or license +terms are listed below. + +-------------------------------------------------------------------------------- + +1. Nothings STB Libraries + +stb/LICENSE + + This software is available under 2 licenses -- choose whichever you prefer. + ------------------------------------------------------------------------------ + ALTERNATIVE A - MIT License + Copyright (c) 2017 Sean Barrett + Permission is hereby granted, free of charge, to any person obtaining a copy of + this software and associated documentation files (the "Software"), to deal in + the Software without restriction, including without limitation the rights to + use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies + of the Software, and to permit persons to whom the Software is furnished to do + so, subject to the following conditions: + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + ------------------------------------------------------------------------------ + ALTERNATIVE B - Public Domain (www.unlicense.org) + This is free and unencumbered software released into the public domain. + Anyone is free to copy, modify, publish, use, compile, sell, or distribute this + software, either in source code form or as a compiled binary, for any purpose, + commercial or non-commercial, and by any means. + In jurisdictions that recognize copyright laws, the author or authors of this + software dedicate any and all copyright interest in the software to the public + domain. We make this dedication for the benefit of the public at large and to + the detriment of our heirs and successors. We intend this dedication to be an + overt act of relinquishment in perpetuity of all present and future rights to + this software under copyright law. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION + WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +-------------------------------------------------------------------------------- + +2. FGPA example designs-gzip + + SDL2.0 + +zlib License + + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + +-------------------------------------------------------------------------------- + +3. Nbody + (c) 2019 Fabio Baruffa + + Plotly.js + Copyright (c) 2020 Plotly, Inc + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +© 2020 GitHub, Inc. + +-------------------------------------------------------------------------------- + +4. GNU-EFI + Copyright (c) 1998-2000 Intel Corporation + +The files in the "lib" and "inc" subdirectories are using the EFI Application +Toolkit distributed by Intel at http://developer.intel.com/technology/efi + +This code is covered by the following agreement: + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION +ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT +TO CHANGE WITHOUT NOTICE. + +-------------------------------------------------------------------------------- + +5. Edk2 + Copyright (c) 2019, Intel Corporation. All rights reserved. + + Edk2 Basetools + Copyright (c) 2019, Intel Corporation. All rights reserved. + +SPDX-License-Identifier: BSD-2-Clause-Patent + +-------------------------------------------------------------------------------- + +6. Heat Transmission + +GNU LESSER GENERAL PUBLIC LICENSE +Version 3, 29 June 2007 + +Copyright © 2007 Free Software Foundation, Inc. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. + +0. Additional Definitions. +As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. + +“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. + +An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. + +A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. + +The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. + +The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. + +1. Exception to Section 3 of the GNU GPL. +You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. + +2. Conveying Modified Versions. +If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: + +a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or +b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. +3. Object Code Incorporating Material from Library Header Files. +The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: + +a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the object code with a copy of the GNU GPL and this license document. +4. Combined Works. +You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: + +a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the Combined Work with a copy of the GNU GPL and this license document. +c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. +d) Do one of the following: +0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. +1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. +e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) +5. Combined Libraries. +You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: + +a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. +b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. +6. Revised Versions of the GNU Lesser General Public License. +The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. + +If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. + +-------------------------------------------------------------------------------- +7. Rodinia + Copyright (c)2008-2011 University of Virginia +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +If you use this software or a modified version of it, please cite the most relevant among the following papers: + + - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International +Symposium on Computer Architecture (ISCA), June 2010. + + - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. +Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium +on Workload Characterization, Oct 2009. + +- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization +for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International +Conference on Supercomputing (ICS), June 2009. + +- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems +Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International +Symposium on Computer Architecture (ISCA), June 2009. + +- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: +A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel +and Distributed Processing Symposium (IPDPS), May 2009. + +- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance +Study of General Purpose Applications on Graphics Processors using CUDA" Journal of +Parallel and Distributed Computing, Elsevier, June 2008. + +-------------------------------------------------------------------------------- +Other names and brands may be claimed as the property of others. + +-------------------------------------------------------------------------------- \ No newline at end of file diff --git a/AI-and-Analytics/Getting-Started-Samples/README.md b/AI-and-Analytics/Getting-Started-Samples/README.md index 4aa716713c..14154dc9fd 100644 --- a/AI-and-Analytics/Getting-Started-Samples/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/README.md @@ -27,5 +27,6 @@ Third party program Licenses can be found here: [third-party-programs.txt](https |Classical Machine Learning| Scikit-learn (OneDAL) | [Intel_Extension_For_SKLearn_GettingStarted](Intel_Extension_For_SKLearn_GettingStarted) | Speed up a scikit-learn application using Intel oneDAL. |Deep Learning
Inference Optimization|Intel® Extension of TensorFlow | [Intel® Extension For TensorFlow GettingStarted](Intel_Extension_For_TensorFlow_GettingStarted) | Guides users how to run a TensorFlow inference workload on both GPU and CPU. |Deep Learning Inference Optimization|oneCCL Bindings for PyTorch | [Intel oneCCL Bindings For PyTorch GettingStarted](Intel_oneCCL_Bindings_For_PyTorch_GettingStarted) | Guides users through the process of running a simple PyTorch* distributed workload on both GPU and CPU. | +|Inference Optimization|JAX Getting Started Sample | [IntelJAX GettingStarted](https://github.com/oneapi-src/oneAPI-samples/tree/development/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted) | The JAX Getting Started sample demonstrates how to train a JAX model and run inference on Intel® hardware. | *Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html)