استخراج النص الكامل من ملفات ProQuest Newsstand إلى CSV

10/03/2024

5 أقل من دقيقة

To read a structured text file in R and extract only the full text to a CSV file, you can follow these steps:

Read the text file into R.
Use regular expressions to identify the start and end points of each article.
Extract the full text of each article.
Write the extracted text to a CSV file.

Here’s an example code that demonstrates these steps:

R
# Read the text file into R
file_path <- "path/to/your/file.txt"
text <- readLines(file_path, warn = FALSE)

# Initialize variables to store the extracted text and flag to indicate article start
extracted_text <- ""
article_start <- FALSE

# Loop through each line of text
for (line in text) {
  # Check if the line starts with "Volltext:" (indicating the start of an article)
  if (grepl("^Volltext:", line)) {
    article_start <- TRUE
    extracted_text <- line  # Start with the first line of the article
  } else if (article_start) {
    # Check if the line starts with "Unternehmen/Organisation:" (indicating the end of an article)
    if (grepl("^Unternehmen/Organisation:", line)) {
      # Write the extracted text to a CSV file
      write.csv(data.frame(Text = extracted_text), "extracted_text.csv", row.names = FALSE)
      
      # Reset variables for the next article
      extracted_text <- ""
      article_start <- FALSE
    } else {
      # Append the line to the extracted text
      extracted_text <- paste(extracted_text, line, sep = " ")
    }
  }
}

Make sure to replace "path/to/your/file.txt" with the actual path to your text file. This code assumes that each article starts with a line containing “Volltext:” and ends with a line containing “Unternehmen/Organisation:”. Adjust the regular expressions ("^Volltext:" and "^Unternehmen/Organisation:") as needed based on the actual structure of your text file.

المزيد من المعلومات

يمكنك استخدام مكتبة tidyverse في R لتسهيل عملية قراءة الملف واستخراج البيانات. يمكنك استخدام دالة read_lines() من مكتبة readr لقراءة الملف، ثم استخدام دوال من مكتبة stringr لتحليل النصوص واستخراج البيانات المطلوبة.

هنا كيف يمكنك القيام بذلك:

R
# تحميل المكتبات
library(readr)
library(stringr)

# قراءة الملف
file_path <- "path/to/your/file.txt"
text <- read_lines(file_path)

# استخراج البيانات
article_start <- grep("^Volltext:", text)
article_end <- grep("^Unternehmen/Organisation:", text)
articles <- lapply(seq_along(article_start), function(i) {
  start <- article_start[i]
  end <- article_end[i]
  paste(text[start:end], collapse = "\n")
})

# تحويل البيانات إلى dataframe
articles_df <- data.frame(Text = unlist(articles))

# كتابة البيانات إلى ملف CSV
write_csv(articles_df, "extracted_text.csv")

هذا الكود يفترض أن كل مقالة تبدأ بسطر يحتوي على “Volltext:” وتنتهي بسطر يحتوي على “Unternehmen/Organisation:”، وأن كل مقالة تنتهي بالسطر الذي يبدأ به سطر المقالة التالي. يمكنك تعديل الكود بناءً على بنية الملف الفعلية لديك.

الوسوم

10/03/2024

5 أقل من دقيقة

المزيد من المعلومات

مقالات ذات صلة

استكشاف البرمجة الكائنية التوجه في بايثون: التعمق والتفاصيل

تحليل وحل مشكلة apc_fetch() في PHP7 مع APCu

تحديات وضعية العناصر الثابتة مع CSS Filters في Microsoft Edge

استكشاف Angular 2 و TypeScript في JSFiddle