GitHub - kilimanj4r0/Internship_Provectus: Solutions for Data Engineering position on Provectus Internship

Prerequisites

Python 3.7 or greater
Docker 19.03 or greater
Git 2.28 or greater
Postgres 13 or greater

Level 1

Files definitions:

src_data - Path with source data needed to be processed.
processed_data - Path with output processed data.
user_id.jpg - User image file, for example, 0001.jpg. Could be several for different users in source data path.
user_id.csv - User info file, for example, 0001.csv. Could be several for different users in source data path.

User csv file contains next columns:

first_name - User first name
last_name - User last name
birthts - User birthdate timestamp in milliseconds UTC

Test csv and img files could be found in the 02-src-data folder

For example:

first_name, last_name, birthts
Ivan, Ivanov, 946674000000

Data processing description

Read csv file
Match images for each user
Combine data from CSV and image path
Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed user. In output CSV and DB we should not duplicate records. Output CSV file format: user_id, first_name, last_name, birthts, img_path

Task

Implement a script to process files from the src_data folder.

Results delivery format

Results should be implemented as a python script with demo data. Also should be provided the README.md file with the description of your solution.

Level 2

The same as Level 1 with the following extras.

Results delivery format

Results should be implemented as a service. The service should periodically read source data and process it. Also, the service should implement web server with next endpoints:

GET /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
POST /data - manually run data processing in src_data

Should be provided the README.md file with the description of your solution.

Level 3

The same as Level 2 but with next differences.

Files definitions:

Source data and processed data should store in Minio. Minio service already defined in docker-compose file.

Data processing description

Read csv file
Match images for each user
Combine data from CSV and image path
Update processed_data/output.csv CSV file and add new data. Important we can update data for previously processed user. In output CSV and DB we should not to duplicate records. Output CSV file format: user_id, first_name, last_name, birthts, img_path
Write this combined data to DB. Record should contain next columns: id, user_id, first_name, last_name, birthdate, img_path. id - autoincrement unique record id. Postgres DB service already defined in docker-compose

Results delivery format

Results should be implemented as a service. The service should periodically read source data and process it. Also, the service should implement web server with next endpoints:

GET /data - get all records from DB in JSON format. Need to implement filtering by: is_image_exists = True/False, user min_age and max_age in years.
POST /data - manually run data processing in src_data

The solution should work in docker-compose. As base template can be taken docker-compose file.

Coding Tasks for Data Engineers

The following tasks cover different sections to check candidate's basic knowledge in SQL, Algorithms and Linux shell.

SQL

Rewrite this SQL without subquery:

SELECT id
FROM users
WHERE id NOT IN (
	SELECT user_id
	FROM departments
	WHERE department_id = 1
);

Write a SQL query to find all duplicate lastnames in a table named user

+----+-----------+-----------
| id | firstname | lastname |
+----+-----------+-----------
| 1  | Ivan      | Sidorov  |
| 2  | Alexandr  | Ivanov   |
| 3  | Petr      | Petrov   |
| 4  | Stepan    | Ivanov   |
+----+-----------+----------+

Write a SQL query to get a username from the user table with the second highest salary from salary tables. Show the username and it's salary in the result.

+---------+--------+
| user_id | salary |
+----+--------+----+
| 1       | 1000   |
| 2       | 1100   |
| 3       | 900    |
| 4       | 1200   |
+---------+--------+

+---------+--------+
| id | username    |
+----+--------+----+
| 1  | Alex       |
| 2  | Maria      |
| 3  | Bob        |
| 4  | Sean       |
+---------+-------+

Algorithms and Data Structures

Optimise execution time of this Python code snippet:

def count_connections(list1: list, list2: list) -> int:
  count = 0
  
  for i in list1:
    for j in list2:
      if i == j:
        count += 1
  
  return count

Given a string s, find the length of the longest substring without repeating characters. Analyze your solution and please provide Space and Time complexities.

Example 1

Input: s = "abcabcbb"
Output: 3
Explanation: The answer is "abc", with the length of 3.

Example 2

Input: s = "bbbbb"
Output: 1
Explanation: The answer is "b", with the length of 1.

Example 3

Input: s = "pwwkew"
Output: 3
Explanation: The answer is "wke", with the length of 3.
Notice that the answer must be a substring, "pwke" is a subsequence and not a substring.

Example 3

Input: s = ""
Output: 0

Given a sorted array of distinct integers and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order.

Example:

Input: nums = [1,3,5,6], target = 5
Output: 2

Linux Shell

List processes listening on ports 80 and 443
List process environment variables by given PID
Launch a python program my_program.py through CLI in the background. How would you close it after some period of time?

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docker-compose		docker-compose
processed_data		processed_data
solutions		solutions
src-data		src-data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

Level 1

Files definitions:

Data processing description

Task

Results delivery format

Level 2

Results delivery format

Level 3

Files definitions:

Data processing description

Results delivery format

Coding Tasks for Data Engineers

SQL

Algorithms and Data Structures

Linux Shell

About

Releases

Packages

Languages

kilimanj4r0/Internship_Provectus

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Level 1

Files definitions:

Data processing description

Task

Results delivery format

Level 2

Results delivery format

Level 3

Files definitions:

Data processing description

Results delivery format

Coding Tasks for Data Engineers

SQL

Algorithms and Data Structures

Linux Shell

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages